Add packed arithmetic binary operation kernel. (#951) #1423

astojilj · 2018-12-03T16:58:50Z

FEATURE

Description

Doesn't support 5-D and 6-D.
Supports add(addN), sub, multiply, realDivide, floorDiv and pow.
Broadcasting is supported only when one of the operands is scalar or 1-D.
Added behind environmental flag 'WEBGL_PACK_BINARY_OPERATIONS'.
Existing arithmetic tests are configured to run both with non-packed and
packed binary operations.

For repository owners only:

Please remember to apply all applicable tags to your pull request.
Tags: FEATURE, BREAKING, BUG, PERF, DEV, DOC, SECURITY

For more info see: https://github.com/tensorflow/tfjs/blob/master/DEVELOPMENT.md

This change is

annxingyuan · 2018-12-03T22:34:52Z

Hey @astojilj - thank you so much for this contribution! Let me know when this is ready for review - currently the tests are failing (related to the new TEST_ENV). FYI we've been testing packed ops by adding describeWithFlags blocks and flipping the relevant flags on at the beginning and off after the tests have run. Here's an example: https://github.com/tensorflow/tfjs-core/blob/master/src/ops/batchnorm_test.ts#L22

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

astojilj · 2018-12-04T09:01:25Z

@annxingyuan, PTAL.

FYI we've been testing packed ops by adding describeWithFlags blocks and flipping the relevant flags on at the beginning and off after the tests have run. Here's an example: https://github.com/tensorflow/tfjs-core/blob/master/src/ops/batchnorm_test.ts#L22

Thanks. I learned from those. In this case, I wanted to run existing tests, both with non-packed and packed operation and didn't write new tests for packed arithmetic ops. Code doing that is in 'arithmetic_test.ts'.

annxingyuan · 2018-12-04T19:49:56Z

Hey @astojilj, before I review this fully, would you mind running MobileNet v2 on an image and confirming that the logits are identical with your WEBGL_PACKED_BINARYOP flag on and off? I tried dropping your change into an internal model and the outputs seem to be different.

Thank you!

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

astojilj · 2018-12-05T08:56:16Z

Hey @astojilj, before I review this fully, would you mind running MobileNet v2 on an image and confirming that the logits are identical with your WEBGL_PACKED_BINARYOP flag on and off? I tried dropping your change into an internal model and the outputs seem to be different.

Thank you!

After rebase to the latest master, also the tests started to fail. Fixed that (Updated packed sampling to the same approach as you did with batchnorm and clip. Thanks, much simpler now.) and the tests pass, locally.

astojilj · 2018-12-05T17:11:21Z

@annxingyuan PTAL. This should be ready now for review and verified with Deeplab V3+ MobileNet2 model.

Hey @astojilj, before I review this fully, would you mind running MobileNet v2 on an image and confirming that the logits are identical with your WEBGL_PACKED_BINARYOP flag on and off?

I will move following content to bug report once I find the way to reproduce it in tests:
When I run DeeplabV3+ MobilenetV2 (model and inference code is explained here and I could setup a repo with the code and model if needed) with master branch (this patch is not applied), inference doesn't work when packed batchnorm is enabled:

Works fine with this in code:

ENV.set('WEBGL_PACK', true);
ENV.set('WEBGL_PACK_BATCHNORMALIZATION', false);

Doesn't work (no errors in console but segmentation doesn't work):

ENV.set('WEBGL_PACK', true);
ENV.set('WEBGL_PACK_BATCHNORMALIZATION', true); // this line is noop as already implied by WEBGL_PACK.

running MobileNet v2 on an image

Is there a demo code you're using to run it on image?

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

FEATURE - Doesn't support 5-D and 6-D. - Supports add(addN), sub, multiply, realDivide, floorDiv and pow. - Broadcasting is supported only when one of the operands is scalar or 1-D. - Added behind environmental flag 'WEBGL_PACK_BINARY_OPERATIONS'.

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

astojilj · 2018-12-12T14:08:53Z

Ensure the tests asuming 'add' is unpacked, run unpacked op.

Minor update to the test to make it future proof.

With or without this patch, there is still an issue with packed batchnorm on DeepLab3+ MobileNetV2 segmentation. I'm trying to reproduce it in unit tests...

This one can be reviewed - there is an issue with Karma setup causing CI to fail today.

annxingyuan

Hey @astojilj thank you for this amazing contribution - just added some comments!

A few more things:

Thanks for pointing out the batchNorm issue. I'm not sure whether this is related, but #1434 fixed a bug with packed broadcasting. Would you mind testing DeepLab3+ MobileNetV2 again against master to see whether it works now?
I like the way you set up the tests as it increases coverage, and it's probably how we should be testing packed operations moving forward. Let me check with my team though before approving an infrastructural change like this.
Could you run some benchmarks on WebGL 2 and 1 (ideally on an iPhone) with packing flags turned on? With your PR there should be a substantial swath of MobileNet2 that never leaves the packed path and I'd love to know what kinds of speedups we might expect.

src/kernels/webgl/binaryop_packed_gpu.ts

src/ops/matmul_test.ts

annxingyuan · 2018-12-13T12:41:53Z

@astojilj

By the way, we have a standalone benchmarking page here: https://github.com/tensorflow/tfjs-core/blob/master/integration_tests/benchmarks/benchmark.html

astojilj · 2018-12-13T14:22:32Z

CI build failed

I need to update this. It is probably better if PR #1448 is reviewed first as:

I like the way you set up the tests as it increases coverage, and it's probably how we should be testing packed operations moving forward. Let me check with my team though before approving an infrastructural change like this.

I'd like to update this patch to the same approach there: activateWebGLPackedTestEnv.

Could you run some benchmarks on WebGL 2 and 1 (ideally on an iPhone) with packing flags turned on? With your PR there should be a substantial swath of MobileNet2 that never leaves the packed path and I'd love to know what kinds of speedups we might expect.

I plan to check with this and PR #1448, how many expensive reshapes are happening, then test on iPhone with this and different packaging scheme (Issue 949).
Note that we still have issues to solve with packaging implementation (higher GPU memory allocation than with all unpacked ops, I noticed there is expensive reshape on every frame running on const node filter in conv2d, etc.) before we can take the measured numbers as valid.

annxingyuan · 2018-12-13T21:18:24Z

@astojilj

Do you have a hypothesis on the higher GPU memory allocation issue?

What do you mean you noticed an expensive reshape on every frame running on const node filter in conv2d? I'm not sure I understand.

astojilj · 2018-12-14T15:55:53Z

@astojilj

Do you have a hypothesis on the higher GPU memory allocation issue?

What do you mean you noticed an expensive reshape on every frame running on const node filter in conv2d? I'm not sure I understand.

Let me cover this offline.

astojilj · 2018-12-20T19:47:31Z

@annxingyuan I've setup the benchmarks for all-packed-operations pipeline. Please check the results and there you'll also find a link to run it.

Code includes non-minified version of tfjs-core if you'd like to debug. Additionally, you might want to always return true in isReshapeFree - that modification renders the output invalid but it shows about 5% improvement when using GL_FLOAT. Otherwise, half float pipeline is much faster when build with only packed ops.

On a related note, opened Issue 1047

Needless to say - if additional measurements are required, e.g. using batch normalization (I have replaced them all by graph transform tool with bias add operations) or separate measurements for packed arithmetics vs batch<->space, just ask.

I will work on splitting #1448 as requested. How to proceed with this PR?

annxingyuan · 2018-12-21T12:18:08Z

@astojilj

Thank you so much for these benchmarks - you bring up a very interesting suggestion about enforcing half floats... I need to look at your results more closely, but let's first of all try to get this PR merged!

The only remaining issue is with testing. I like the way you set the tests up but I don't want to introduce this infrastructural change quite yet... would you mind writing a small set of packed arithmetic tests? Before we merge your other PR (#1448) I will try to add a global packed testing environment so we don't have to continue replicating tests.

You don't have to match the coverage on our existing arithmetic tests - could you just add 6 unit tests to arithmetic_test.ts for the following scenarios:

packed POW same shape
packed INT_DIV tensorlike chained
packed DIV broadcasting 5D + 2D
packed DIV same rank tensors different shape
packed POW gradient: tensor2D
packed INT_DIV gradient: tensor2d with broadcast

It doesn't matter to me which ops you test for the different scenarios - as long as each of your new ops gets hit.

I realize this means much less coverage, but it's only temporary!

annxingyuan · 2018-12-21T13:45:29Z

@astojilj

Oh wait I'm realizing that this does not support broadcasting - so maybe the unit test scenarios I listed are not the right ones to target - my point was just that you can pick a few scenarios to test!

astojilj · 2018-12-21T17:11:57Z

src/kernels/webgl/binaryop_gpu.ts

@@ -52,7 +52,7 @@ export const POW = `
 if(a < 0.0 && floor(b) < b){
  return NAN;
 }
-return (round(mod(b, 2.0)) == 0 || round(mod(b, 2.0)) == 2) ?
+return (round(mod(b, 2.0)) != 1) ?


@annxingyuan This change is outside the scope of the patch as it affects non packed ops. If needed I could take it out.

astojilj · 2018-12-21T17:19:39Z

@astojilj

Oh wait I'm realizing that this does not support broadcasting - so maybe the unit test scenarios I listed are not the right ones to target - my point was just that you can pick a few scenarios to test!

@annxingyuan, since we'll get packed environment in TEST_ENVS, I've opened Issue 1050 to cover it, copied and modified original arithmetic_test.js to arithmetic_packed_test.js to set WEBGL_PACK_BINARY_OPERATIONS before running tests, and added TODO to remove the arithmetic_packed_test.js when work on Issue 1050 is done.

annxingyuan · 2018-12-26T14:15:01Z

@astojilj

Awesome - really glad to have this in the library - thank you for your patience with this PR! I will try to get tensorflow/tfjs#1050 done ASAP.

astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 4, 2018

Don't try to run packed binary tests in 'yarn node-test'

fecb398

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

astojilj force-pushed the packedadd branch from fecb398 to 37a9e95 Compare December 4, 2018 08:35

astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 4, 2018

Don't try to run packed binary tests in 'yarn node-test'

37a9e95

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

astojilj force-pushed the packedadd branch from 37a9e95 to 17a9d5b Compare December 5, 2018 08:05

astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 5, 2018

Don't try to run packed binary tests in 'yarn node-test'

39948f2

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

astojilj mentioned this pull request Dec 5, 2018

Don't upload small tensors as uniforms to packed op programs #1429

Merged

astojilj force-pushed the packedadd branch from 17a9d5b to 3d4f19f Compare December 5, 2018 19:55

astojilj added a commit to astojilj/tfjs-core that referenced this pull request Dec 5, 2018

Don't try to run packed binary tests in 'yarn node-test'

cbbd92d

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

astojilj mentioned this pull request Dec 11, 2018

Packed batch<=>space ND: transpose, slice, 5-D and 6-D packed sampling. #1443

Closed

astojilj added 5 commits December 12, 2018 13:40

Don't try to run packed binary tests in 'yarn node-test'

6435f9b

This is a fix to PR tensorflow#1423 issue - webgl testing environment is not runnable from yarn node-test.

Rebase packed binary ops to latest packed sampling code.

66613c6

Fix broadcast support check in usePackedBinaryOp

4ad7859

Ensure the tests asuming 'add' is unpacked, run unpacked op.

01fe2f5

astojilj force-pushed the packedadd branch from 3d4f19f to 01fe2f5 Compare December 12, 2018 12:21

Merge branch 'master' into packedadd

f71d196

annxingyuan suggested changes Dec 13, 2018

View reviewed changes

src/kernels/webgl/binaryop_packed_gpu.ts Outdated Show resolved Hide resolved

src/ops/matmul_test.ts Outdated Show resolved Hide resolved

annxingyuan assigned astojilj Dec 13, 2018

annxingyuan added the PERF label Dec 13, 2018

Merge branch 'master' into packedadd

bd3068d

Merge branch 'master' into packedadd

6682796

astojilj added 2 commits December 14, 2018 17:38

packed div handling a/a != 1. fix packed sampler for padding 1->2.

f2cd83e

matmul test fix WEBGL_LAZILY_UNPACK

05f33c8

astojilj mentioned this pull request Dec 20, 2018

Mechanism to enforce use of gl.HALF_FLOAT with both WebGL backends tensorflow/tfjs#1047

Closed

astojilj mentioned this pull request Dec 21, 2018

Add packed operations testing environment tensorflow/tfjs#1050

Closed

astojilj added 2 commits December 21, 2018 18:56

Add arithmetic packed tests. Remove TEST_ENV approach.

d0e70ec

Merge branch 'master' into packedadd1

9949ae9

astojilj commented Dec 21, 2018

View reviewed changes

Merge branch 'master' into packedadd

c84ea32

annxingyuan approved these changes Dec 26, 2018

View reviewed changes

annxingyuan merged commit 5501cd8 into tensorflow:master Dec 26, 2018

Add packed arithmetic binary operation kernel. (#951) #1423

Add packed arithmetic binary operation kernel. (#951) #1423

Uh oh!

Conversation

astojilj commented Dec 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

For repository owners only:

Uh oh!

annxingyuan commented Dec 3, 2018

Uh oh!

astojilj commented Dec 4, 2018

Uh oh!

annxingyuan commented Dec 4, 2018

Uh oh!

astojilj commented Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astojilj commented Dec 5, 2018

Uh oh!

astojilj commented Dec 12, 2018

Uh oh!

annxingyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

annxingyuan commented Dec 13, 2018

Uh oh!

astojilj commented Dec 13, 2018

Uh oh!

annxingyuan commented Dec 13, 2018

Uh oh!

astojilj commented Dec 14, 2018

Uh oh!

astojilj commented Dec 20, 2018

Uh oh!

annxingyuan commented Dec 21, 2018

Uh oh!

annxingyuan commented Dec 21, 2018

Uh oh!

astojilj Dec 21, 2018

Choose a reason for hiding this comment

Uh oh!

astojilj commented Dec 21, 2018

Uh oh!

annxingyuan commented Dec 26, 2018

Uh oh!

Uh oh!

astojilj commented Dec 3, 2018 •

edited

Loading

astojilj commented Dec 5, 2018 •

edited

Loading