WebGL backend hands off kernel execution to CPU if inputs are small and on the CPU #1371

annxingyuan · 2018-11-01T14:44:01Z

Changes

Added a shouldExecuteOnCPU method to the MathBackendWebGL that checks whether inputs are on the CPU and within a size threshold
Singled out certain ops (stridedSlice, slice, concat, min, max, subtract, multiply, greater, less) for passing kernel execution to the CPU if conditions are met. These are the ops that get called most frequently in COCO SSD with tiny inputs.

This PR is not productionized (no tests or environment flag) - I wasn't sure whether we wanted to just experiment with the approach first.

COCO SSD benchmark

MacBook Pro
on master: 277ms
on webgl_cpu_handoff: 184ms (+50.5%)

Linux workstation
on master: 372ms
on webgl_cpu_handoff: 191ms (+95%)

iPhone 8 - iOS11
on master: 1170ms
on webgl_cpu_handoff: 976ms (+20%)

These are prediction times across 20 runs.

Note

I think there's a better way to write this so we don't have to repeat the checking logic in so many places. If we just create an array of kernel names we want to apply this treatment to, then we could loop through them and forward to the CPU using arguments. And if we could get a list of all the kernels then we could apply the treatment across the webGL backend. But it might require some TypeScript overrides...

For repository owners only:

Please remember to apply all applicable tags to your pull request.
Tags: FEATURE, BREAKING, BUG, PERF, DEV, DOC, SECURITY

For more info see: https://github.com/tensorflow/tfjs/blob/master/DEVELOPMENT.md

This change is

dsmilkov

Nice job on the benchmarking and getting this done so fast!

Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan, @nsthorat, and @dsmilkov)

src/kernels/backend_webgl.ts, line 508 at r1 (raw file):

    }
    this.textureManager = new TextureManager(this.gpgpu);
    this.cpuBackend = new MathBackendCPU();

instead of directly importing the CPUBackend, let's do
this.cpuBackend = ENV.findBackend('cpu');
Then in shouldExecuteOnCPU do:
if (this.cpuBackend == null) { return false; }. This way you don't import from backend_cpu directly and the two backends remain decoupled.

src/kernels/backend_webgl.ts, line 512 at r1 (raw file):

  private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10):
      boolean {

let's add a flag for this forwarding: WEBGL_CPU_FORWARD and let's change test-all.sh to launch travis with this flag set to false, and a single new travis with this flag set to true, in order to hit the if() codepath. (we can chat in person)

dsmilkov

Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan and @nsthorat)

src/kernels/backend_webgl.ts, line 511 at r1 (raw file):

  }

  private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10):

extract sizeThreshold into a constant (upper-case) at the top of the file.

nkreeger

Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan and @nsthorat)

src/kernels/backend_webgl.ts, line 511 at r1 (raw file):

private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10):

Can we add a detailed comment here about the optimization please? And point to an issue about how we can optimize running specific ops on each backend as discussed over GVC?

Thanks!

dsmilkov

Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan and @nsthorat)

src/kernels/backend_webgl.ts, line 515 at r1 (raw file):

    return inputs.every(
        input => this.texData.get(input.dataId).texture == null &&
            util.sizeFromShape(input.shape) < sizeThreshold);

use input.size instead of sizeFromShape() since we precompute it once during Tensor construction. Also, would be great to have a follow-up PR that replaces sizeFromShape(a.shape) with a.size throughout our codebase.

annxingyuan

Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger and @nsthorat)

src/kernels/backend_webgl.ts, line 508 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

instead of directly importing the CPUBackend, let's do
this.cpuBackend = ENV.findBackend('cpu');
Then in shouldExecuteOnCPU do:
if (this.cpuBackend == null) { return false; }. This way you don't import from backend_cpu directly and the two backends remain decoupled.

Done

src/kernels/backend_webgl.ts, line 511 at r1 (raw file):

Previously, nkreeger (Nick Kreeger) wrote…

private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10):
Can we add a detailed comment here about the optimization please? And point to an issue about how we can optimize running specific ops on each backend as discussed over GVC?

Thanks!

Done

src/kernels/backend_webgl.ts, line 511 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

extract sizeThreshold into a constant (upper-case) at the top of the file.

Done

src/kernels/backend_webgl.ts, line 512 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

let's add a flag for this forwarding: WEBGL_CPU_FORWARD and let's change test-all.sh to launch travis with this flag set to false, and a single new travis with this flag set to true, in order to hit the if() codepath. (we can chat in person)

Done

src/kernels/backend_webgl.ts, line 515 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

use input.size instead of sizeFromShape() since we precompute it once during Tensor construction. Also, would be great to have a follow-up PR that replaces sizeFromShape(a.shape) with a.size throughout our codebase.

Done

tensorflow/tfjs#873

pyu10055

I still strongly believe this coupling between different backends should not be necessary. The logic should be bubble up to at least the engine level. For example depthToSpace calls the engine runKernel method:

return ENV.engine.runKernel(
backend => backend.depthToSpace($x, blockSize, dataFormat), {$x});

In engine runKernel code, runKernel only calls the current backend, it should iterate through the list of available backends. WebGL should instead of saying webgl automatically pick a backend, it should give up the execution of the op, and the engine automatically forward to the next available engine.

pyu10055 · 2018-11-02T16:07:17Z

src/kernels/backend_webgl.ts

@@ -128,6 +129,7 @@ export interface TensorHandle {
  dtype: DataType;
 }

+const CPU_HANDOFF_SIZE_THRESHOLD = 10;


add explanation on this hard-coded number.

nsthorat · 2018-11-02T17:22:30Z

@pyu10055 I thought about this even more, and I actually believe we won't be able to move this out of the WebGL backend code, even if we do some smart backend allocation down the road (which I still think we should, though it's lower priority because there isn't a concrete use-case).

The WebGL-backend knows more information than we can ever pass up to the engine, namely where the data actually is stored physically -- the tensor data can be in the WebGL backend, but on the CPU as a Float32Array or the GPU as a WebGLTexture. The actual physical location is not available outside the backend because that complexity is hidden by the WebGL backend. For example, we wouldn't want the engine to know about paging logic which can move data between the CPU and the GPU within the WebGL backend, but the WebGL backend should be able to make decisions about where to run operations that depend on how paging logic has moved data around.

pyu10055

Another question I have about this change is, why does this actually have such a good improvement?
Heuristically, even if the inputs are small, having them run on CPU would mean downloading the inputs from GPU to CPU which is automatically 5ms delay, unless these inputs are already on CPU.
Blindly executing these ops on CPU would probably be problematic for certain models, which makes me believe we should also check for input location before deciding to forward them to CPU.

pyu10055 · 2018-11-02T17:44:49Z

@pyu10055 I thought about this even more, and I actually believe we won't be able to move this out of the WebGL backend code, even if we do some smart backend allocation down the road (which I still think we should, though it's lower priority because there isn't a concrete use-case).

The WebGL-backend knows more information than we can ever pass up to the engine, namely where the data actually is stored physically -- the tensor data can be in the WebGL backend, but on the CPU as a Float32Array or the GPU as a WebGLTexture. The actual physical location is not available outside the backend because that complexity is hidden by the WebGL backend. For example, we wouldn't want the engine to know about paging logic which can move data between the CPU and the GPU within the WebGL backend, but the WebGL backend should be able to make decisions about where to run operations that depend on how paging logic has moved data around.

I agree what you said that WebGL has more detail information about location of the tensors, but it does not conflict with the decision to give up the execution of the op. The engine does not need to know the details, it just need to push the execution to the next available backend.

The coupling makes more sense to me if the cpu backend is a child of webgl backend and hidden from the engine. Since we have the new syntax of running op on a specific backend now, this type of forwarding can be surprising if the user specifically want to run the op on a particular backend.

dsmilkov

Regarding how this gets improvements: We already check that the inputs are on CPU before forwarding to CPUBackend so we avoid downloads from GPU - see inside shouldExecuteOnCPU(). We'll document that better in the comment above the method.
--------------
IIUC, your suggestion is for backends to dynamically let the engine know their preference of executing this particular op, with these particular inputs, at this particular time. This is basically what we want to do longer-term --- a large refactor of engine/runKernel, since we have to make 2 steps -- ask ALL backends how they feel about this op, with these inputs, and decide which one to finally call. This is a large refactor since we have to register kernels using strings, instead of being methods on an interface, which has implications for typings, also change all calls to runKernel, etc). Again, something we should postpone until we get a better sense of what exactly that solution will look like with real-world use-cases.

Reviewed 5 of 5 files at r2.
Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, @annxingyuan, and @nsthorat)

src/kernels/backend_webgl.ts, line 525 at r2 (raw file):

  This is a heuristic for determing when it would be faster to execute a kernel
  on the CPU. WebGL kernels opt into running this check and forwarding when
  appropriate.

Say a bit that we are only forwarding when all the inputs are on CPU.

annxingyuan

Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, and @nsthorat)

src/kernels/backend_webgl.ts, line 132 at r2 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

add explanation on this hard-coded number.

Done

src/kernels/backend_webgl.ts, line 525 at r2 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Say a bit that we are only forwarding when all the inputs are on CPU.

Done

nsthorat · 2018-11-05T18:51:56Z

I think we can move forward with this change.

dsmilkov

Reviewed 1 of 1 files at r3.
Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, and @nsthorat)

scripts/test-all.sh, line 29 at r3 (raw file):

# already downloaded binary.
npm-run-all -p -c --aggregate-output \
  "test-travis --browsers=bs_safari_mac --features '{\"HAS_WEBGL\": false}' --backend cpu" \

fyi, you'll have some conflict since I renamed this file to test-travis.sh and changed it slightly in #1372

nkreeger

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, @annxingyuan, and @nsthorat)

src/environment.ts, line 309 at r4 (raw file):

So just to be 100% sure - we're defaulting this feature to true?

src/kernels/backend_webgl.ts, line 512 at r4 (raw file):

private getCPUBackend()

Should this method have a typescript return value?

#1379 added a unit test that failed in travis due to numerical precision: https://travis-ci.org/tensorflow/tfjs-core/jobs/451654174#L801 This PR: - fixes that test - Also fixes `test-travis.sh` to make sure we don't forward webgl --> cpu except for the last browser run -- probably caused during conflict resolving of #1371 and #1372 (https://reviewable.io/reviews/tensorflow/tfjs-core/1371#-LQ_a2y92YYLRMiYWqhC) DEV

@dsmilkov

INTERNAL PR resolves tensorflow/tfjs#873 and responds to @dsmilkov 's comment #1371 (review)

annxingyuan added 2 commits November 1, 2018 10:20

execute certain ops on cpu

a82158e

not if complex

c72ed1a

annxingyuan added the PERF label Nov 1, 2018

annxingyuan self-assigned this Nov 1, 2018

annxingyuan requested review from dsmilkov and nsthorat November 1, 2018 14:44

annxingyuan added 3 commits November 1, 2018 10:45

Merge branch 'master' into webgl_cpu_handoff

9ea5726

remove lint

04b8f03

Merge branch 'master' into webgl_cpu_handoff

b43f476

dsmilkov reviewed Nov 1, 2018

View reviewed changes

nkreeger suggested changes Nov 1, 2018

View reviewed changes

annxingyuan changed the title ~~WebGL backend should hand off kernel execution to CPU if inputs are small and on the CPU~~ WebGL backend hands off kernel execution to CPU if inputs are small and on the CPU Nov 1, 2018

dsmilkov reviewed Nov 1, 2018

View reviewed changes

annxingyuan added 3 commits November 1, 2018 14:42

address pr feedback

1241b59

add webglcpuforward env flag

538fc18

add test to travis

61afb78

annxingyuan commented Nov 1, 2018

View reviewed changes

annxingyuan added 5 commits November 1, 2018 15:25

adjust testall scenario

27fb7f9

Merge branch 'master' into webgl_cpu_handoff

48ecd92

do not cpu forward for local testing

857688f

Merge branch 'master' into webgl_cpu_handoff

ad8d73c

delay looking up cpu backend in env registry

036e738

pyu10055 reviewed Nov 2, 2018

View reviewed changes

dsmilkov reviewed Nov 2, 2018

View reviewed changes

update comments

3116a40

annxingyuan commented Nov 2, 2018

View reviewed changes

Merge branch 'master' into webgl_cpu_handoff

2db0230

dsmilkov approved these changes Nov 5, 2018

View reviewed changes

annxingyuan added 3 commits November 5, 2018 15:52

Merge branch 'master' into webgl_cpu_handoff

b61d6c6

add scenario to testall

5c0f71b

Merge branch 'master' into webgl_cpu_handoff

cfa417c

nkreeger approved these changes Nov 5, 2018

View reviewed changes

annxingyuan merged commit cb7e3e0 into master Nov 5, 2018

annxingyuan deleted the webgl_cpu_handoff branch November 5, 2018 21:27

dsmilkov mentioned this pull request Nov 7, 2018

Fix failing test in travis #1381

Merged

tedhtchang mentioned this pull request Mar 6, 2019

use precomputed tensor size #1608

Merged

dsmilkov pushed a commit that referenced this pull request Mar 7, 2019

use precomputed tensor size (#1608)

4beeddf

INTERNAL PR resolves tensorflow/tfjs#873 and responds to @dsmilkov 's comment #1371 (review)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WebGL backend hands off kernel execution to CPU if inputs are small and on the CPU #1371

WebGL backend hands off kernel execution to CPU if inputs are small and on the CPU #1371

Uh oh!

annxingyuan commented Nov 1, 2018 •

edited

Loading

Uh oh!

dsmilkov left a comment

Uh oh!

dsmilkov left a comment

Uh oh!

nkreeger left a comment

Uh oh!

dsmilkov left a comment

Uh oh!

annxingyuan left a comment

Uh oh!

pyu10055 left a comment

Uh oh!

pyu10055 Nov 2, 2018

Uh oh!

nsthorat commented Nov 2, 2018 •

edited

Loading

Uh oh!

pyu10055 left a comment •

edited

Loading

Uh oh!

pyu10055 commented Nov 2, 2018 •

edited

Loading

Uh oh!

dsmilkov left a comment

Uh oh!

annxingyuan left a comment

Uh oh!

nsthorat commented Nov 5, 2018

Uh oh!

dsmilkov left a comment

Uh oh!

nkreeger left a comment

Uh oh!

Uh oh!

WebGL backend hands off kernel execution to CPU if inputs are small and on the CPU #1371

WebGL backend hands off kernel execution to CPU if inputs are small and on the CPU #1371

Uh oh!

Conversation

annxingyuan commented Nov 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

COCO SSD benchmark

Note

For repository owners only:

Uh oh!

dsmilkov left a comment

Choose a reason for hiding this comment

Uh oh!

dsmilkov left a comment

Choose a reason for hiding this comment

Uh oh!

nkreeger left a comment

Choose a reason for hiding this comment

Uh oh!

dsmilkov left a comment

Choose a reason for hiding this comment

Uh oh!

annxingyuan left a comment

Choose a reason for hiding this comment

Uh oh!

pyu10055 left a comment

Choose a reason for hiding this comment

Uh oh!

pyu10055 Nov 2, 2018

Choose a reason for hiding this comment

Uh oh!

nsthorat commented Nov 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pyu10055 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyu10055 commented Nov 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsmilkov left a comment

Choose a reason for hiding this comment

Uh oh!

annxingyuan left a comment

Choose a reason for hiding this comment

Uh oh!

nsthorat commented Nov 5, 2018

Uh oh!

dsmilkov left a comment

Choose a reason for hiding this comment

Uh oh!

nkreeger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

annxingyuan commented Nov 1, 2018 •

edited

Loading

nsthorat commented Nov 2, 2018 •

edited

Loading

pyu10055 left a comment •

edited

Loading

pyu10055 commented Nov 2, 2018 •

edited

Loading