8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs #25609

limingliu-ampere · 2025-06-03T07:14:03Z

This PR is to enable the use of crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU. There is an option UseCryptoPmullForCRC32 that can enable crypto pmull, but directly enabling it on Ampere CPU will cause the following problems.

There will be regressions (-14% ~ -8%) on Ampere1 when the length is 64. When <= 128, both kernel_crc32_using_crc32 and kernel_crc32_using_crypto_pmull use the loop labeled as CRC_by32_loop, but their implements are a little different, and the loop in kernel_crc32_using_crc32 is better at hiding latency on Ampere1. So this PR takes the loop in kernel_crc32_using_crc32 to kernel_crc32_using_crypto_pmull, and does the same for CRC32C intrinsic.
The intrinsics only use crypto pmull when the length is higher than 383, while the loop in kernel_crc32_common_fold_using_crypto_pmull looks able to handle 256, and if it handles 256 on Ampere1, the improvements can be as high as 110% compared with kernel_crc32_using_crc32/kernel_crc32c_using_crc32c. However, there are regressions (~-6%) on Neoverse V1 when the length is 256. So this PR introduces a new option named CryptoPmullForCRC32LowLimit. It defaults to 256 since the code could handle 256, while it is set to 384 for V1/V2 to keep the old behavior on these platforms.

The performance regressions and improvements were measured with the following microbenchmarks:
org.openjdk.bench.java.util.TestCRC32.testCRC32Update
org.openjdk.bench.java.util.TestCRC32C.testCRC32CUpdate

Ran the following JTReg tests on Ampere1 and did not find problems:
test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java
test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs (Enhancement - P4)

Reviewers

Andrew Haley (@theRealAph - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25609/head:pull/25609
$ git checkout pull/25609

Update a local copy of the PR:
$ git checkout pull/25609
$ git pull https://git.openjdk.org/jdk.git pull/25609/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25609

View PR using the GUI difftool:
$ git pr show -t 25609

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25609.diff

Using Webrev

Link to Webrev Comment

…re CPU

bridgekeeper · 2025-06-03T07:15:04Z

👋 Welcome back lliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-06-03T07:15:48Z

@limingliu-ampere This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs

Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 350 new commits pushed to the master branch:

620df7e: 8359801: RISC-V: Simplify Interpreter::profile_taken_branch
6b43939: 8359270: C2: alignment check should consider base offset when emitting arraycopy runtime call
81985d4: 8358526: Clarify behavior of java.awt.HeadlessException constructed with no-args
... and 347 more: https://git.openjdk.org/jdk/compare/04e0fe00abcf1d7919a50e0c9dd44ce2856984ea...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@eme64, @theRealAph) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk · 2025-06-03T07:16:28Z

@limingliu-ampere The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-06-03T07:19:11Z

Webrevs

theRealAph · 2025-06-03T08:35:22Z

src/hotspot/cpu/aarch64/globals_aarch64.hpp

+  product(intx, CryptoPmullForCRC32LowLimit, 256,                       \
+          "Minimum size in bytes when Crypto PMULL will be used."       \
+          "Value must be a multiple of 128.")                           \
+          range(256, max_jint)                                          \


This shouldn't be a general product flag.

Suggested change

product(intx, CryptoPmullForCRC32LowLimit, 256, \

"Minimum size in bytes when Crypto PMULL will be used." \

"Value must be a multiple of 128.") \

range(256, max_jint) \

product(intx, CryptoPmullForCRC32LowLimit, 256, DIAGNOSTIC, \

"Minimum size in bytes when Crypto PMULL will be used." \

"Value must be a multiple of 128.") \

range(256, max_jint) \

eme64

@limingliu-ampere Thanks for working on this! 😊

Generally looks reasonable to me as a non expert in crypto intrinsics. But we definitively need an expert to approve this in the end. I have a few comments below.

Also: it would be nice to have a sanity test where you use that new flag. It could also be an additional run in an existing test (that's probably even better). You may want to run it with a few different values, including non-multiple of 128 just to sanity check the alignment correction as well. I don't know how much runtime that would add, so that should be checked before going too crazy. Having different values for the flag helps us to simulate the behavior of other hardware for example, and that can be quite useful in general. What do you think?

eme64 · 2025-06-04T08:25:44Z

src/hotspot/cpu/aarch64/globals_aarch64.hpp

+  product(intx, CryptoPmullForCRC32LowLimit, 256, DIAGNOSTIC,           \
+          "Minimum size in bytes when Crypto PMULL will be used."       \
+          "Value must be a multiple of 128.")                           \
+          range(256, max_jint)                                          \


Is it sane to have negative values? If not, use uintx... or maybe even just uint?

eme64 · 2025-06-04T08:27:10Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

@@ -4332,7 +4332,7 @@ void MacroAssembler::kernel_crc32_using_crypto_pmull(Register crc, Register buf,
    Label CRC_by4_loop, CRC_by1_loop, CRC_less128, CRC_by128_pre, CRC_by32_loop, CRC_less32, L_exit;
    assert_different_registers(crc, buf, len, tmp0, tmp1, tmp2);

-    subs(tmp0, len, 384);
+    subs(tmp0, len, CryptoPmullForCRC32LowLimit);


Would it make sense to have another alignment sanity check here? It would be both helpful to make sure nobody later breaks your assumption, and could also be helpful for the reader to see the 128 alignment immediately.

I think the alignment does not effect the correctness here, but it should be >= 256. So I added the corresponding assertion above.

theRealAph · 2025-06-04T08:42:58Z

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

    crc32x(crc, crc, tmp2);
-    subs(len, len, 32);


What is the point of these changes?

To be more precise: converting these adjustments to post-increment operations isn't obviously an improvement on AArch64 generally. How does it help?

According to perf, post-increment ops help to reduce the access to TLB on Ampere1 in this case.

According to perf, post-increment ops help to reduce the access to TLB on Ampere1 in this case.

Hmm, but it's code in a rather odd style in shared code. And from what I see, the intrinsic is only 22% of the runtime (for 128 bytes) anyway, and you're making the code larger. I certainly don't want to see this sort of thing proliferating in the intrinsics.

In general, it's up to CPU designers to make simple, straightforward code work well.

How important is this?

On the other hand this code already exists in CRC32C, so it's simply unifying the two routines. OK, I won't object.

you're making the code larger.

I don't think this makes the code larger.

How important is this?

As I mentioned in problem 1, this causes a regression (~-14%) on Ampere1 when handling 64 bytes. No obvious effects in other cases though.

so it's simply unifying the two routines.

Yes.

limingliu-ampere · 2025-06-11T03:38:07Z

/integrate

openjdk · 2025-06-11T03:39:10Z

@limingliu-ampere
Your change (at version df9f920) is now ready to be sponsored by a Committer.

theRealAph · 2025-06-11T08:39:44Z

Generally looks reasonable to me as a non expert in crypto intrinsics. But we definitively need an expert to approve this in the end. I have a few comments below.

I'm happy enough, and it seems also to improvements on Apple silicon.

However, the title is misleading: while this PR may have started as something purely Ampere-specific, it no longer is. Something like "AArch64: CRC32/CRC32 enhancements" is perhaps rather vague, but at least it's true.

limingliu-ampere · 2025-06-20T07:04:05Z

Ping.

theRealAph · 2025-06-20T08:26:40Z

Ping.

When you reply to my last point.

limingliu-ampere · 2025-06-20T08:41:56Z

Ping.

When you reply to my last point.

Does this mean changing the title? But I don't get how does the patch help Apple, since the patch does not effect the default behavior on Apple. There would be changes when enabling UseCryptoPmullForCRC32 for 383 bytes or smaller. So, are the improvements from this?

theRealAph · 2025-06-20T08:51:21Z

Ping.

When you reply to my last point.

Does this mean changing the title? But I don't get how does the patch help Apple, since the patch does not effect the default behavior on Apple. There would be changes when enabling UseCryptoPmullForCRC32 for 383 bytes or smaller. So, are the improvements from this?

This PR does not only enable crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU, it also improves the algorithm for some cases of short arrays. I made a suggestion for a title that accurately describes this PR. Feel free to write your own accurate title, or use mine.

eme64

Looks like this is progressing nicely here :)

I have 2 comments below. Once you addressed them, I'll run some internal testing, and then I can approve it :)

eme64 · 2025-06-23T05:48:48Z

src/hotspot/cpu/aarch64/globals_aarch64.hpp

@@ -89,6 +89,10 @@ define_pd_global(intx, InlineSmallCode,          1000);
          "Use CRC32 instructions for CRC32 computation")               \
  product(bool, UseCryptoPmullForCRC32, false,                          \
          "Use Crypto PMULL instructions for CRC32 computation")        \
+  product(uint, CryptoPmullForCRC32LowLimit, 256, DIAGNOSTIC,           \


Can you please add a test that uses this flag, and sets it to some selected values, and maybe even a random value?

Is there already an IR test that checks for the presence of the crypto pmull? That could be good to ensure it occurs as expected and only when expected :)

There are test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java and TestCRC32C.java. They cover various lengths for the input, and test the intrinsics with the default value of the flag. But they do not cover different values of the flag, which I think could be covered by VM_OPTIONS. I feel that it is not suitable to add the flag in the @run tag, since it is aarch64 specific while the test is generic.

eme64 · 2025-06-23T05:50:57Z

src/hotspot/cpu/aarch64/vm_version_aarch64.cpp

+  if (!(is_aligned(CryptoPmullForCRC32LowLimit, 128))) {
+    warning("CryptoPmullForCRC32LowLimit must be a multiple of 128");
+    CryptoPmullForCRC32LowLimit = align_down(CryptoPmullForCRC32LowLimit, 128);
+  }


Can you describe somewhere why it has to be a multiple of 128? Imagine someone comes across this later, and wonders if that is just some strange implementation limitation or something more fundamental, or something very subtle.

There are 4 kinds of loops labeled as CRC_by128_loop, CRC_by32_loop, CRC_by4_loop and CRC_by1_loop. If the flag is 266 which is 128x2+10, then for 265 bytes of inputs, there are 256 bytes that are handled by CRC_by32_loop, while for 266 bytes of inputs, the corresponding 256 bytes are handled by CRC_by128_loop, and I think this cases inconsistency. If CRC_by32_loop handles 256 bytes better than CRC_by128_loop on a platform, it should be used for 266 bytes as well.

limingliu-ampere added 2 commits May 28, 2025 23:43

Introduce CryptoPmullForCRC32LowLimit and use pmull for crc32 on Ampe…

b1f4e1f

…re CPU

Use the utility functions

8aa9657

openjdk bot changed the title ~~JDK-8358032~~ 8358032: Use crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU Jun 3, 2025

openjdk bot added the rfr Pull request is ready for review label Jun 3, 2025

openjdk bot added the hotspot [email protected] label Jun 3, 2025

theRealAph reviewed Jun 3, 2025

View reviewed changes

Make it be a diagnostic flag

db926eb

eme64 suggested changes Jun 4, 2025

View reviewed changes

theRealAph reviewed Jun 4, 2025

View reviewed changes

limingliu-ampere added 2 commits June 4, 2025 22:46

Use uint for the option and assert it >= 256

9b2bae6

Add the message for the assertions

df9f920

theRealAph approved these changes Jun 9, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jun 9, 2025

openjdk bot added the sponsor Pull request is ready to be sponsored label Jun 11, 2025

openjdk bot removed sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated labels Jun 23, 2025

limingliu-ampere changed the title ~~8358032: Use crypto pmull for CRC32/CRC32C intrinsics on Ampere CPU~~ 8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs Jun 23, 2025

openjdk bot added sponsor Pull request is ready to be sponsored ready Pull request is ready to be integrated labels Jun 23, 2025

eme64 suggested changes Jun 23, 2025

View reviewed changes

8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs #25609

Are you sure you want to change the base?

8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs #25609

Uh oh!

Conversation

limingliu-ampere commented Jun 3, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Jun 3, 2025

Uh oh!

openjdk bot commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Jun 3, 2025

Uh oh!

mlbridge bot commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eme64 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

limingliu-ampere commented Jun 11, 2025

Uh oh!

openjdk bot commented Jun 11, 2025

Uh oh!

theRealAph commented Jun 11, 2025

Uh oh!

limingliu-ampere commented Jun 20, 2025

Uh oh!

theRealAph commented Jun 20, 2025

Uh oh!

limingliu-ampere commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theRealAph commented Jun 20, 2025

Uh oh!

eme64 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

limingliu-ampere Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

limingliu-ampere commented Jun 3, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Jun 3, 2025 •

edited

Loading

mlbridge bot commented Jun 3, 2025 •

edited

Loading

eme64 left a comment •

edited

Loading

limingliu-ampere commented Jun 20, 2025 •

edited

Loading

limingliu-ampere Jun 24, 2025 •

edited

Loading