Skip to content

[HUDI-8637] Refactor broadcast to a more generic reader context factory #13148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 17, 2025

Conversation

the-other-tim-brown
Copy link
Contributor

Change Logs

General refactoring to move away from the spark specific broadcast nomenclature for ReaderContext to a serializable factory that serves basically the same purpose. Main changes:

  • Serializable factory for generating the ReaderContext
  • EngineContext will generate the factory since these readers will be engine specific
  • ReaderContext now contains the storage configuration that reads should use
  • Clean up use of closeable on existing code in the path

Impact

This is mainly cleanup and setup work for making the FileGroupReader the default throughout the codebase.

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Apr 15, 2025
CompactionOperation operation,
String instantTime,
Option<EngineBroadcastManager> broadcastManagerOpt) throws IOException {
HoodieWriteConfig writeConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fix the indentation.


import java.io.Serializable;

public interface ReaderContextFactory<T> extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some doc to it.


@Override
public void close() {
this.schemaToVersionId.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not releated with the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the cleanup. Generally there seems to be over use of closeable in some new code. This cleanup will make it easier to be consistent with enforcing that all closeables are closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cleanup should be part of a follow up PR, then I can remove the changes related to this


private final transient HoodieEngineContext context;
private final transient HoodieTableMetaClient metaClient;
class SparkReaderContextFactory implements ReaderContextFactory<InternalRow> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc for this class.

@@ -319,9 +319,6 @@ public void close() throws IOException {
if (recordBuffer != null) {
recordBuffer.close();
}
if (readerContext != null) {
readerContext.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the problem here with the close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader context doesn't have anything to close so there is no need for this line currently and no need to make the class implement closeable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the schema cache is small?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the schema cache does not actually have anything that requires closing. The cache object is only referenced from the context that creates it. Once that context is no longer referenced, the cache is eligible for GC.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 2bbb5fe into apache:master Apr 17, 2025
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants