Skip to content
This repository was archived by the owner on Aug 15, 2019. It is now read-only.

Add packed arithmetic binary operation kernel. (#951) #1423

Merged
merged 13 commits into from
Dec 26, 2018

Conversation

astojilj
Copy link
Contributor

@astojilj astojilj commented Dec 3, 2018

FEATURE

Description

  • Doesn't support 5-D and 6-D.
  • Supports add(addN), sub, multiply, realDivide, floorDiv and pow.
  • Broadcasting is supported only when one of the operands is scalar or 1-D.
  • Added behind environmental flag 'WEBGL_PACK_BINARY_OPERATIONS'.
  • Existing arithmetic tests are configured to run both with non-packed and
    packed binary operations.

For repository owners only:

Please remember to apply all applicable tags to your pull request.
Tags: FEATURE, BREAKING, BUG, PERF, DEV, DOC, SECURITY

For more info see: https://github.com/tensorflow/tfjs/blob/master/DEVELOPMENT.md


This change is Reviewable

@annxingyuan
Copy link
Collaborator

Hey @astojilj - thank you so much for this contribution! Let me know when this is ready for review - currently the tests are failing (related to the new TEST_ENV). FYI we've been testing packed ops by adding describeWithFlags blocks and flipping the relevant flags on at the beginning and off after the tests have run. Here's an example: https://github.com/tensorflow/tfjs-core/blob/master/src/ops/batchnorm_test.ts#L22

astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 4, 2018
This is a fix to PR tensorflow#1423 issue - webgl testing environment
is not runnable from yarn node-test.
astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 4, 2018
This is a fix to PR tensorflow#1423 issue - webgl testing environment
is not runnable from yarn node-test.
@astojilj
Copy link
Contributor Author

astojilj commented Dec 4, 2018

@annxingyuan, PTAL.

FYI we've been testing packed ops by adding describeWithFlags blocks and flipping the relevant flags on at the beginning and off after the tests have run. Here's an example: https://github.com/tensorflow/tfjs-core/blob/master/src/ops/batchnorm_test.ts#L22

Thanks. I learned from those. In this case, I wanted to run existing tests, both with non-packed and packed operation and didn't write new tests for packed arithmetic ops. Code doing that is in 'arithmetic_test.ts'.

@annxingyuan
Copy link
Collaborator

Hey @astojilj, before I review this fully, would you mind running MobileNet v2 on an image and confirming that the logits are identical with your WEBGL_PACKED_BINARYOP flag on and off? I tried dropping your change into an internal model and the outputs seem to be different.

Thank you!

astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 5, 2018
This is a fix to PR tensorflow#1423 issue - webgl testing environment
is not runnable from yarn node-test.
@astojilj
Copy link
Contributor Author

astojilj commented Dec 5, 2018

Hey @astojilj, before I review this fully, would you mind running MobileNet v2 on an image and confirming that the logits are identical with your WEBGL_PACKED_BINARYOP flag on and off? I tried dropping your change into an internal model and the outputs seem to be different.

Thank you!

After rebase to the latest master, also the tests started to fail. Fixed that (Updated packed sampling to the same approach as you did with batchnorm and clip. Thanks, much simpler now.) and the tests pass, locally.

@astojilj
Copy link
Contributor Author

astojilj commented Dec 5, 2018

@annxingyuan PTAL. This should be ready now for review and verified with Deeplab V3+ MobileNet2 model.

Hey @astojilj, before I review this fully, would you mind running MobileNet v2 on an image and confirming that the logits are identical with your WEBGL_PACKED_BINARYOP flag on and off?

I will move following content to bug report once I find the way to reproduce it in tests:
When I run DeeplabV3+ MobilenetV2 (model and inference code is explained here and I could setup a repo with the code and model if needed) with master branch (this patch is not applied), inference doesn't work when packed batchnorm is enabled:

Works fine with this in code:

ENV.set('WEBGL_PACK', true);
ENV.set('WEBGL_PACK_BATCHNORMALIZATION', false);

Doesn't work (no errors in console but segmentation doesn't work):

ENV.set('WEBGL_PACK', true);
ENV.set('WEBGL_PACK_BATCHNORMALIZATION', true); // this line is noop as already implied by WEBGL_PACK.

running MobileNet v2 on an image

Is there a demo code you're using to run it on image?

astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 5, 2018
This is a fix to PR tensorflow#1423 issue - webgl testing environment
is not runnable from yarn node-test.
FEATURE

- Doesn't support 5-D and 6-D.
- Supports add(addN), sub, multiply, realDivide, floorDiv and pow.
- Broadcasting is supported only when one of the operands is scalar or 1-D.
- Added behind environmental flag 'WEBGL_PACK_BINARY_OPERATIONS'.
This is a fix to PR tensorflow#1423 issue - webgl testing environment
is not runnable from yarn node-test.
@astojilj
Copy link
Contributor Author

Ensure the tests asuming 'add' is unpacked, run unpacked op.

Minor update to the test to make it future proof.

With or without this patch, there is still an issue with packed batchnorm on DeepLab3+ MobileNetV2 segmentation. I'm trying to reproduce it in unit tests...

This one can be reviewed - there is an issue with Karma setup causing CI to fail today.

Copy link
Collaborator

@annxingyuan annxingyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @astojilj thank you for this amazing contribution - just added some comments!

A few more things:

  • Thanks for pointing out the batchNorm issue. I'm not sure whether this is related, but #1434 fixed a bug with packed broadcasting. Would you mind testing DeepLab3+ MobileNetV2 again against master to see whether it works now?
  • I like the way you set up the tests as it increases coverage, and it's probably how we should be testing packed operations moving forward. Let me check with my team though before approving an infrastructural change like this.
  • Could you run some benchmarks on WebGL 2 and 1 (ideally on an iPhone) with packing flags turned on? With your PR there should be a substantial swath of MobileNet2 that never leaves the packed path and I'd love to know what kinds of speedups we might expect.

@annxingyuan
Copy link
Collaborator

@astojilj

By the way, we have a standalone benchmarking page here: https://github.com/tensorflow/tfjs-core/blob/master/integration_tests/benchmarks/benchmark.html

@astojilj
Copy link
Contributor Author

CI build failed

I need to update this. It is probably better if PR #1448 is reviewed first as:

I like the way you set up the tests as it increases coverage, and it's probably how we should be testing packed operations moving forward. Let me check with my team though before approving an infrastructural change like this.

I'd like to update this patch to the same approach there: activateWebGLPackedTestEnv.

Could you run some benchmarks on WebGL 2 and 1 (ideally on an iPhone) with packing flags turned on? With your PR there should be a substantial swath of MobileNet2 that never leaves the packed path and I'd love to know what kinds of speedups we might expect.

I plan to check with this and PR #1448, how many expensive reshapes are happening, then test on iPhone with this and different packaging scheme (Issue 949).
Note that we still have issues to solve with packaging implementation (higher GPU memory allocation than with all unpacked ops, I noticed there is expensive reshape on every frame running on const node filter in conv2d, etc.) before we can take the measured numbers as valid.

@annxingyuan
Copy link
Collaborator

@astojilj

Do you have a hypothesis on the higher GPU memory allocation issue?

What do you mean you noticed an expensive reshape on every frame running on const node filter in conv2d? I'm not sure I understand.

@astojilj
Copy link
Contributor Author

@astojilj

Do you have a hypothesis on the higher GPU memory allocation issue?

What do you mean you noticed an expensive reshape on every frame running on const node filter in conv2d? I'm not sure I understand.

Let me cover this offline.

@astojilj
Copy link
Contributor Author

@annxingyuan I've setup the benchmarks for all-packed-operations pipeline. Please check the results and there you'll also find a link to run it.

Code includes non-minified version of tfjs-core if you'd like to debug. Additionally, you might want to always return true in isReshapeFree - that modification renders the output invalid but it shows about 5% improvement when using GL_FLOAT. Otherwise, half float pipeline is much faster when build with only packed ops.

On a related note, opened Issue 1047

Needless to say - if additional measurements are required, e.g. using batch normalization (I have replaced them all by graph transform tool with bias add operations) or separate measurements for packed arithmetics vs batch<->space, just ask.

I will work on splitting #1448 as requested. How to proceed with this PR?

@annxingyuan
Copy link
Collaborator

@astojilj

Thank you so much for these benchmarks - you bring up a very interesting suggestion about enforcing half floats... I need to look at your results more closely, but let's first of all try to get this PR merged!

The only remaining issue is with testing. I like the way you set the tests up but I don't want to introduce this infrastructural change quite yet... would you mind writing a small set of packed arithmetic tests? Before we merge your other PR (#1448) I will try to add a global packed testing environment so we don't have to continue replicating tests.

You don't have to match the coverage on our existing arithmetic tests - could you just add 6 unit tests to arithmetic_test.ts for the following scenarios:

  • packed POW same shape
  • packed INT_DIV tensorlike chained
  • packed DIV broadcasting 5D + 2D
  • packed DIV same rank tensors different shape
  • packed POW gradient: tensor2D
  • packed INT_DIV gradient: tensor2d with broadcast

It doesn't matter to me which ops you test for the different scenarios - as long as each of your new ops gets hit.

I realize this means much less coverage, but it's only temporary!

@annxingyuan
Copy link
Collaborator

@astojilj

Oh wait I'm realizing that this does not support broadcasting - so maybe the unit test scenarios I listed are not the right ones to target - my point was just that you can pick a few scenarios to test!

@@ -52,7 +52,7 @@ export const POW = `
if(a < 0.0 && floor(b) < b){
return NAN;
}
return (round(mod(b, 2.0)) == 0 || round(mod(b, 2.0)) == 2) ?
return (round(mod(b, 2.0)) != 1) ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annxingyuan This change is outside the scope of the patch as it affects non packed ops. If needed I could take it out.

@astojilj
Copy link
Contributor Author

@astojilj

Oh wait I'm realizing that this does not support broadcasting - so maybe the unit test scenarios I listed are not the right ones to target - my point was just that you can pick a few scenarios to test!

@annxingyuan, since we'll get packed environment in TEST_ENVS, I've opened Issue 1050 to cover it, copied and modified original arithmetic_test.js to arithmetic_packed_test.js to set WEBGL_PACK_BINARY_OPERATIONS before running tests, and added TODO to remove the arithmetic_packed_test.js when work on Issue 1050 is done.

@annxingyuan
Copy link
Collaborator

@astojilj

Awesome - really glad to have this in the library - thank you for your patience with this PR! I will try to get tensorflow/tfjs#1050 done ASAP.

@annxingyuan annxingyuan merged commit 5501cd8 into tensorflow:master Dec 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants