Skip to content

Remove caching for remainders of state backed iterables. #34718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

robertwb
Copy link
Contributor

State backed iterables are used specifically for the case where the full iterable in question has already been deemed onerously large to fit into memory, which is the antithesis of the kind of thing we want to cache. In particular, for iterables too large to entirely fit into the cache, this will evict everything else in the cache as a side effect of iterating over them.
Also, unlike standard state (including side input) values, these iterables are much less likely to be iterated over more than once, reducing the value of caching them.

Using a cache here is particularly problematic when writing avro (or parquet) files with fixed sharding as these sinks already do significant buffering of their own, and some of their common inputs (like org.apache.avro.util.Utf8 objects) can double or triple in heald memory when certain methods are accessed (which is not reflected in the cache sizing algorithsm that assume objects cannot change notably in retained memory once placed in the cache).


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions github-actions bot added the java label Apr 23, 2025
@robertwb
Copy link
Contributor Author

R: @priyansndesai

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

State backed iterables are used specifically for the case where the full
iterable in question has already been deemed onerously large to fit into
memory, which is the antithesis of the kind of thing we want to cache.
In particular, for iterables too large to entirely fit into the cache,
this will evict everything else in the cache as a side effect of iterating
over them.
Also, unlike standard state (including side input) values, these iterables
are much less likely to be iterated over more than once, reducing the
value of caching them.

Using a cache here is particularly problematic when writing avro (or parquet)
files with fixed sharding as these sinks already do significant buffering of
their own, and some of their common inputs (like org.apache.avro.util.Utf8
objects) can double or triple in heald memory when certain methods are accessed
(which is not reflected in the cache sizing algorithsm that assume objects
cannot change notably in retained memory once placed in the cache).
@robertwb robertwb force-pushed the state-backed-gbk-caching branch from d2eb0ed to 46db19b Compare April 23, 2025 19:20
@robertwb
Copy link
Contributor Author

Rendered obsolete by #34746.

@robertwb robertwb closed this Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant