[HUDI-9316] Add Avro based ReaderContext to assist in migration to FileGroupReader #13171

the-other-tim-brown · 2025-04-18T16:58:01Z

Change Logs

Moves the test reader context to a proper avro based reader context with its own unit testing
Fixes tests that were not closing resources properly

Impact

In order to migrate all reader paths to a consistent logical flow, we need to define readers that work with all the required engines. Currently the engines will all work with the Avro IndexedRecords since that is what the LogScanners will output today. This is the common denominator that will allow us to switch over the code and then focus on building optimal reader paths for each engine.

Risk level (write none, low medium or high below)

None, this PR in itself is just setting up a larger set of refactoring

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

… context

the-other-tim-brown · 2025-04-18T23:24:55Z

...ent/src/test/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderOnJavaTestBase.java

+import java.util.List;
+import java.util.Map;
+
+public abstract class HoodieFileGroupReaderOnJavaTestBase<T> extends TestHoodieFileGroupReaderBase<T> {


The contents of this class are moved from TestHoodieFileGroupReaderOnHive

the-other-tim-brown · 2025-04-18T23:26:54Z

hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/read/TestCustomMerger.java

@@ -118,7 +116,6 @@ public static void setUp() throws IOException {
        INSERT, DELETE, UPDATE, DELETE, UPDATE);
    instantTimes = Arrays.asList(
        "001", "002", "003", "004", "005");
-    shouldWritePositions = Arrays.asList(false, false, false, false, false);


In this test and the others, this was previously a static variable that was updated by tests so there was no deterministic execution of these tests as a result since test ordering is random. Moved this to an instance variable so each test run would be deterministic.

...ent/src/test/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderOnJavaTestBase.java

...-client/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnJava.java

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java

yihua · 2025-04-21T22:55:07Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java

+    if (metaFieldsPopulated) {
+      return getFieldValueFromIndexedRecord(record, schema, RECORD_KEY_METADATA_FIELD).toString();
+    }
+    return keyGenerator.getRecordKey((GenericRecord) record);


nit: ideally this can be made general in HoodieReaderContext. OK to keep it as is in this PR.

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java

yihua · 2025-04-21T23:11:36Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java

  }

  @Override
  public UnaryOperator<IndexedRecord> projectRecord(Schema from, Schema to, Map<String, String> renamedColumns) {
    if (!renamedColumns.isEmpty()) {
-      throw new UnsupportedOperationException("Schema evolution is not supported for the test reader context");
+      throw new UnsupportedOperationException("Schema evolution is not supported for the HoodieAvroReaderContext");
    }
    Map<String, Integer> fromFields = IntStream.range(0, from.getFields().size())


If we still need #projectRecord, it would be good to cache the transformation based on <Schema from, Schema to, Map<String, String> renamedColumns> instead of computing the transformation upon each record (BaseSparkInternalRowReaderContext does something similar).

Another optimization is to push down the projection to the reader itself so the reader iterator directly returns the IndexRecord based on the to schema if possible to avoid reinstantiating the record here. This may require more investigation, and we can keep the functional correctness without worrying about the performance in this PR for now.

+1 to building the transform. There are other places we can do this as well that I have found while digging into this code path.

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java

hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java

hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/read/TestCustomMerger.java

yihua · 2025-04-21T23:41:39Z

hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/read/TestEventTimeMerging.java

+    // Create dedicated merger to avoid current delete logic holes.
+    // TODO: Unify delete logic (HUDI-7240).


@linliu-code is this fixed?

hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/read/TestEventTimeMerging.java

...op-common/src/test/java/org/apache/hudi/common/table/read/TestOverwriteWithLatestMerger.java

…rContext.java Co-authored-by: Y Ethan Guo <[email protected]>

yihua

LGTM. @the-other-tim-brown let's make sure the follow-ups are tracked or addressed in subsequent PRs.

hudi-bot · 2025-04-22T19:46:01Z

CI report:

8a47005 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

the-other-tim-brown added 5 commits April 17, 2025 11:07

update reader context

c8e1ea2

add test cases for avro reader context, support virtual keys for this…

a08e8c1

… context

Merge remote-tracking branch 'origin/master' into avro-reader-context

9b63d97

add functional test for new avro based reader context

31fe25d

fix serialization issues

25cbdfb

github-actions bot added the size:XL PR with lines of changes > 1000 label Apr 18, 2025

the-other-tim-brown added 2 commits April 18, 2025 13:00

add missing arg

d7c6a8d

remove todos

b00cbac

the-other-tim-brown commented Apr 20, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into avro-reader-context

5310d1b