feat:large-row-skip-in-bigtable | added experimental options to skip … #34245

sarthakbhutani · 2025-03-11T10:06:23Z

…large rows while reading from bigtable

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

…large rows while reading from bigtable

github-actions · 2025-03-11T10:38:53Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java

justinuang · 2025-03-11T15:21:41Z

...le-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableServiceImpl.java

+      try {
+        stream =
+            client
+                .skipLargeRowsCallable(new BigtableRowProtoAdapter())


I forget, does this throw an exception with the large rows, or just silently swallow them?

it swallows them & returns the next non-large row

chamikaramj · 2025-03-27T22:05:40Z

Retest this please

…check

sarthakbhutani · 2025-04-22T10:09:11Z

not sure why spotless check is failing - this is related to code formatting. Working fine on local.

github-actions · 2025-04-22T12:11:42Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @djyau for label bigtable.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

igorbernstein2 · 2025-04-22T12:28:04Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java

+     *
+     * <p>This is incompatible with withMaxBufferElementCount()
+     */
+    public Read withExperimentalSkipLargeRows(@Nullable Boolean skipLargeRows) {


Please dont encode feature status in the public abi. The makes it impossible to evolve the feature into ga status. If this feature is not ready for GA, then please use
ExperimentalOptions which we already use for BIGTABLE_ENABLE_CLIENT_SIDE_METRICS

@justinuang - will we go with GA for skipLargeRows
or do we want to keep it experiemental - if yes, why?

Let's use the experimental flag because it's not final state we want to have. The example code Igor mentioned:

beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableServiceFactory.java

Line 131 in 3b5a2b6

if (ExperimentalOptions.hasExperiment(pipelineOptions, BIGTABLE_ENABLE_CLIENT_SIDE_METRICS)) {

igorbernstein2 · 2025-04-22T12:33:05Z

...le-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableServiceImpl.java

+        if (bigtableReadOptions != null
+            && Boolean.TRUE.equals(bigtableReadOptions.getExperimentalSkipLargeRows())) {
+          stream =
+              client
+                  .skipLargeRowsCallable(new BigtableRowProtoAdapter())
+                  .call(query, GrpcCallContext.createDefault());
+        } else {
+          stream =
+              client
+                  .readRowsCallable(new BigtableRowProtoAdapter())
+                  .call(query, GrpcCallContext.createDefault());
+        }


Resolving of the feature should be done during pipeline construction not during execution

Also you should factor out the common code:

readRowsCallable = client.skipLargeRowsCallable(new BigtableRowProtoAdapter()) if (isLargeRowSkippingEnabled) { readRowsCallable = client.readRowsCallable(new BigtableRowProtoAdapter()); } readRowsCallable.call(...

regarding code suggestion -
instead of having if-else,
do we want to override the code later?

Resolving of the feature should be done during pipeline construction not during execution
initially, i had made this into a separate reader class itself.
since this was a experimental option, It came out in the discussion that we dont want the overhead of maintaining this as a separate reader for new changes.

and I had to revert these changes - 50f7924

I think you can pass in a unary callable when creating the BigtableReaderImpl instead of the client:

beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableServiceImpl.java

Lines 653 to 672 in 3b5a2b6

public Reader createReader(BigtableSource source) throws IOException {

if (source.getMaxBufferElementCount() != null) {

return BigtableSegmentReaderImpl.create(

client,

projectId,

instanceId,

source.getTableId().get(),

source.getRanges(),

source.getRowFilter(),

source.getMaxBufferElementCount());

} else {

return new BigtableReaderImpl(

client,

projectId,

instanceId,

source.getTableId().get(),

source.getRanges(),

source.getRowFilter());

}

}

No, I meant keep the unary callable where it is, but resolve unset value in the settings. ie this logic:

bigtableReadOptions != null && Boolean.TRUE.equals(bigtableReadOptions.getExperimentalSkipLargeRows()

What unset means should be resolved during pipeline construction

igorbernstein2 · 2025-04-22T12:34:07Z

.../google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableReadIT.java

+        new BigtableOptions.Builder().setProjectId(project).setInstanceId(options.getInstanceId());
+
+    final String tableId = "BigtableReadTest";
+    final long numRows = 1000L;


how does this test large row skipping?

this doesn't.
the main logic lies in the java-client. Apache beam implementation is only a wrapper to call that implementation. I couldn't figure out - how to test the large row skipping in a IT here - it's already being done in the java-client.

it came out in our discussion earlier, that we need a data integrity check where no data loss should happen.
hence, this is a check for data integrity - that if there isn't a large row, the feature still works as expected - reading all the rows.

I think you can do a bit better here

testE2EBigtableReadWithSkippingLargeRows() { //... // add an error injector to trigger large row logic ExperimentalOptions.addExperiment("bigtable_settings_override", InterceptorInjector.class.getName()); //... } static class LargeRowErrorInterceptor implements ClientInterceptor { @Override public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) { return new ForwardingClientCall.SimpleForwardingClientCall<ReqT, RespT>(next.newCall(method, callOptions)) { private boolean artificiallyClosed = false; private int numMsgs = 0; @Override public void start(Listener<RespT> responseListener, Metadata headers) { super.start(new ForwardingClientCallListener.SimpleForwardingClientCallListener<RespT>() { @Override public void onMessage(RespT message) { if (++numMsgs > 10) { artificiallyClosed = true; delegate().onClose( Status.WHATEVER_ERROR_TRIGGERS_PAGING, new Metadata() ); return; } super.onMessage(message); } @Override public void onClose(Status status, Metadata trailers) { if (!artificiallyClosed) { super.onClose(status, trailers); } } }, headers); } }; } } public static class InterceptorInjector implements BiFunction<BigtableDataSettings.Builder, PipelineOptions, BigtableDataSettings.Builder> { @Override public BigtableDataSettings.Builder apply(BigtableDataSettings.Builder builder, PipelineOptions pipelineOptions) { InstantiatingGrpcChannelProvider.Builder transportChannelProvider = ((InstantiatingGrpcChannelProvider) builder.stubSettings() .getTransportChannelProvider()) .toBuilder(); ApiFunction<ManagedChannelBuilder, ManagedChannelBuilder> oldConf = transportChannelProvider.getChannelConfigurator(); transportChannelProvider.setChannelConfigurator(b -> { if (oldConf!=null) { b = oldConf.apply(b); } return b.intercept(new LargeRowErrorInterceptor()); }); return null; } }

feat:large-row-skip-in-bigtable | added experimental options to skip …

0eb2494

…large rows while reading from bigtable

github-actions bot added java io gcp bigtable labels Mar 11, 2025

justinuang reviewed Mar 11, 2025

View reviewed changes

sarthakbhutani added 2 commits March 14, 2025 01:05

removed the separate reader for skipping large rows

50f7924

code formatted

f648a5a

sarthakbhutani added 2 commits March 28, 2025 15:02

Merge branch 'apache:master' into large-row-skip-in-bigtable

c5150ff

Code refractoring - Improve readability of experimentalSkipLargeRows …

16b04a2

…check

Code formatting

e253472

github-actions bot added the Next Action: Reviewers label Apr 22, 2025

igorbernstein2 suggested changes Apr 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat:large-row-skip-in-bigtable | added experimental options to skip … #34245

feat:large-row-skip-in-bigtable | added experimental options to skip … #34245

sarthakbhutani commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

justinuang Mar 11, 2025

sarthakbhutani Mar 13, 2025

chamikaramj commented Mar 27, 2025

sarthakbhutani commented Apr 22, 2025

github-actions bot commented Apr 22, 2025

igorbernstein2 Apr 22, 2025

sarthakbhutani Apr 24, 2025

mutianf Apr 24, 2025

igorbernstein2 Apr 22, 2025

sarthakbhutani Apr 24, 2025

mutianf Apr 24, 2025

igorbernstein2 Apr 25, 2025

igorbernstein2 Apr 22, 2025

sarthakbhutani Apr 24, 2025

igorbernstein2 Apr 25, 2025

	public Reader createReader(BigtableSource source) throws IOException {
	if (source.getMaxBufferElementCount() != null) {
	return BigtableSegmentReaderImpl.create(
	client,
	projectId,
	instanceId,
	source.getTableId().get(),
	source.getRanges(),
	source.getRowFilter(),
	source.getMaxBufferElementCount());
	} else {
	return new BigtableReaderImpl(
	client,
	projectId,
	instanceId,
	source.getTableId().get(),
	source.getRanges(),
	source.getRowFilter());
	}
	}

feat:large-row-skip-in-bigtable | added experimental options to skip … #34245

Are you sure you want to change the base?

feat:large-row-skip-in-bigtable | added experimental options to skip … #34245

Conversation

sarthakbhutani commented Mar 11, 2025

GitHub Actions Tests Status (on master branch)

github-actions bot commented Mar 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj commented Mar 27, 2025

sarthakbhutani commented Apr 22, 2025

github-actions bot commented Apr 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment