-
Notifications
You must be signed in to change notification settings - Fork 945
WebGL backend hands off kernel execution to CPU if inputs are small and on the CPU #1371
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on the benchmarking and getting this done so fast!
Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan, @nsthorat, and @dsmilkov)
src/kernels/backend_webgl.ts, line 508 at r1 (raw file):
} this.textureManager = new TextureManager(this.gpgpu); this.cpuBackend = new MathBackendCPU();
instead of directly importing the CPUBackend, let's do
this.cpuBackend = ENV.findBackend('cpu');
Then in shouldExecuteOnCPU
do:
if (this.cpuBackend == null) { return false; }
. This way you don't import from backend_cpu directly and the two backends remain decoupled.
src/kernels/backend_webgl.ts, line 512 at r1 (raw file):
private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10): boolean {
let's add a flag for this forwarding: WEBGL_CPU_FORWARD
and let's change test-all.sh
to launch travis with this flag set to false, and a single new travis with this flag set to true, in order to hit the if() codepath. (we can chat in person)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan and @nsthorat)
src/kernels/backend_webgl.ts, line 511 at r1 (raw file):
} private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10):
extract sizeThreshold into a constant (upper-case) at the top of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan and @nsthorat)
src/kernels/backend_webgl.ts, line 511 at r1 (raw file):
private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10):
Can we add a detailed comment here about the optimization please? And point to an issue about how we can optimize running specific ops on each backend as discussed over GVC?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @annxingyuan and @nsthorat)
src/kernels/backend_webgl.ts, line 515 at r1 (raw file):
return inputs.every( input => this.texData.get(input.dataId).texture == null && util.sizeFromShape(input.shape) < sizeThreshold);
use input.size
instead of sizeFromShape() since we precompute it once during Tensor construction. Also, would be great to have a follow-up PR that replaces sizeFromShape(a.shape)
with a.size
throughout our codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger and @nsthorat)
src/kernels/backend_webgl.ts, line 508 at r1 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
instead of directly importing the CPUBackend, let's do
this.cpuBackend = ENV.findBackend('cpu');
Then inshouldExecuteOnCPU
do:
if (this.cpuBackend == null) { return false; }
. This way you don't import from backend_cpu directly and the two backends remain decoupled.
Done
src/kernels/backend_webgl.ts, line 511 at r1 (raw file):
Previously, nkreeger (Nick Kreeger) wrote…
private shouldExecuteOnCPU(inputs: Tensor[], sizeThreshold = 10):
Can we add a detailed comment here about the optimization please? And point to an issue about how we can optimize running specific ops on each backend as discussed over GVC?
Thanks!
Done
src/kernels/backend_webgl.ts, line 511 at r1 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
extract sizeThreshold into a constant (upper-case) at the top of the file.
Done
src/kernels/backend_webgl.ts, line 512 at r1 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
let's add a flag for this forwarding:
WEBGL_CPU_FORWARD
and let's changetest-all.sh
to launch travis with this flag set to false, and a single new travis with this flag set to true, in order to hit the if() codepath. (we can chat in person)
Done
src/kernels/backend_webgl.ts, line 515 at r1 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
use
input.size
instead of sizeFromShape() since we precompute it once during Tensor construction. Also, would be great to have a follow-up PR that replacessizeFromShape(a.shape)
witha.size
throughout our codebase.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still strongly believe this coupling between different backends should not be necessary. The logic should be bubble up to at least the engine level. For example depthToSpace calls the engine runKernel method:
return ENV.engine.runKernel(
backend => backend.depthToSpace($x, blockSize, dataFormat), {$x});
In engine runKernel code, runKernel only calls the current backend, it should iterate through the list of available backends. WebGL should instead of saying webgl automatically pick a backend, it should give up the execution of the op, and the engine automatically forward to the next available engine.
@@ -128,6 +129,7 @@ export interface TensorHandle { | |||
dtype: DataType; | |||
} | |||
|
|||
const CPU_HANDOFF_SIZE_THRESHOLD = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add explanation on this hard-coded number.
@pyu10055 I thought about this even more, and I actually believe we won't be able to move this out of the WebGL backend code, even if we do some smart backend allocation down the road (which I still think we should, though it's lower priority because there isn't a concrete use-case). The WebGL-backend knows more information than we can ever pass up to the engine, namely where the data actually is stored physically -- the tensor data can be in the WebGL backend, but on the CPU as a Float32Array or the GPU as a WebGLTexture. The actual physical location is not available outside the backend because that complexity is hidden by the WebGL backend. For example, we wouldn't want the engine to know about paging logic which can move data between the CPU and the GPU within the WebGL backend, but the WebGL backend should be able to make decisions about where to run operations that depend on how paging logic has moved data around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question I have about this change is, why does this actually have such a good improvement?
Heuristically, even if the inputs are small, having them run on CPU would mean downloading the inputs from GPU to CPU which is automatically 5ms delay, unless these inputs are already on CPU.
Blindly executing these ops on CPU would probably be problematic for certain models, which makes me believe we should also check for input location before deciding to forward them to CPU.
I agree what you said that WebGL has more detail information about location of the tensors, but it does not conflict with the decision to give up the execution of the op. The engine does not need to know the details, it just need to push the execution to the next available backend. The coupling makes more sense to me if the cpu backend is a child of webgl backend and hidden from the engine. Since we have the new syntax of running op on a specific backend now, this type of forwarding can be surprising if the user specifically want to run the op on a particular backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding how this gets improvements: We already check that the inputs are on CPU before forwarding to CPUBackend so we avoid downloads from GPU - see inside shouldExecuteOnCPU()
. We'll document that better in the comment above the method.
--------------
IIUC, your suggestion is for backends to dynamically let the engine know their preference of executing this particular op, with these particular inputs, at this particular time. This is basically what we want to do longer-term --- a large refactor of engine/runKernel, since we have to make 2 steps -- ask ALL backends how they feel about this op, with these inputs, and decide which one to finally call. This is a large refactor since we have to register kernels using strings, instead of being methods on an interface, which has implications for typings, also change all calls to runKernel, etc). Again, something we should postpone until we get a better sense of what exactly that solution will look like with real-world use-cases.
Reviewed 5 of 5 files at r2.
Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, @annxingyuan, and @nsthorat)
src/kernels/backend_webgl.ts, line 525 at r2 (raw file):
This is a heuristic for determing when it would be faster to execute a kernel on the CPU. WebGL kernels opt into running this check and forwarding when appropriate.
Say a bit that we are only forwarding when all the inputs are on CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, and @nsthorat)
src/kernels/backend_webgl.ts, line 132 at r2 (raw file):
Previously, pyu10055 (Ping Yu) wrote…
add explanation on this hard-coded number.
Done
src/kernels/backend_webgl.ts, line 525 at r2 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
Say a bit that we are only forwarding when all the inputs are on CPU.
Done
I think we can move forward with this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: 0 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, and @nsthorat)
scripts/test-all.sh, line 29 at r3 (raw file):
# already downloaded binary. npm-run-all -p -c --aggregate-output \ "test-travis --browsers=bs_safari_mac --features '{\"HAS_WEBGL\": false}' --backend cpu" \
fyi, you'll have some conflict since I renamed this file to test-travis.sh and changed it slightly in #1372
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @nkreeger, @pyu10055, @annxingyuan, and @nsthorat)
src/environment.ts, line 309 at r4 (raw file):
So just to be 100% sure - we're defaulting this feature to true
?
src/kernels/backend_webgl.ts, line 512 at r4 (raw file):
private getCPUBackend()
Should this method have a typescript return value?
#1379 added a unit test that failed in travis due to numerical precision: https://travis-ci.org/tensorflow/tfjs-core/jobs/451654174#L801 This PR: - fixes that test - Also fixes `test-travis.sh` to make sure we don't forward webgl --> cpu except for the last browser run -- probably caused during conflict resolving of #1371 and #1372 (https://reviewable.io/reviews/tensorflow/tfjs-core/1371#-LQ_a2y92YYLRMiYWqhC) DEV
INTERNAL PR resolves tensorflow/tfjs#873 and responds to @dsmilkov 's comment #1371 (review)
This PR addresses tensorflow/tfjs#826
Changes
shouldExecuteOnCPU
method to theMathBackendWebGL
that checks whether inputs are on the CPU and within a size thresholdstridedSlice
,slice
,concat
,min
,max
,subtract
,multiply
,greater
,less
) for passing kernel execution to the CPU if conditions are met. These are the ops that get called most frequently in COCO SSD with tiny inputs.This PR is not productionized (no tests or environment flag) - I wasn't sure whether we wanted to just experiment with the approach first.
COCO SSD benchmark
MacBook Pro
on
master
: 277mson
webgl_cpu_handoff
: 184ms (+50.5%)Linux workstation
on
master
: 372mson
webgl_cpu_handoff
: 191ms (+95%)iPhone 8 - iOS11
on
master
: 1170mson
webgl_cpu_handoff
: 976ms (+20%)These are prediction times across 20 runs.
Note
I think there's a better way to write this so we don't have to repeat the checking logic in so many places. If we just create an array of kernel names we want to apply this treatment to, then we could loop through them and forward to the CPU using arguments. And if we could get a list of all the kernels then we could apply the treatment across the webGL backend. But it might require some TypeScript overrides...
For repository owners only:
Please remember to apply all applicable tags to your pull request.
Tags: FEATURE, BREAKING, BUG, PERF, DEV, DOC, SECURITY
For more info see: https://github.com/tensorflow/tfjs/blob/master/DEVELOPMENT.md
This change is