-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[HUDI-9316] Add Avro based ReaderContext to assist in migration to FileGroupReader #13171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-9316] Add Avro based ReaderContext to assist in migration to FileGroupReader #13171
Conversation
import java.util.List; | ||
import java.util.Map; | ||
|
||
public abstract class HoodieFileGroupReaderOnJavaTestBase<T> extends TestHoodieFileGroupReaderBase<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of this class are moved from TestHoodieFileGroupReaderOnHive
@@ -118,7 +116,6 @@ public static void setUp() throws IOException { | |||
INSERT, DELETE, UPDATE, DELETE, UPDATE); | |||
instantTimes = Arrays.asList( | |||
"001", "002", "003", "004", "005"); | |||
shouldWritePositions = Arrays.asList(false, false, false, false, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test and the others, this was previously a static variable that was updated by tests so there was no deterministic execution of these tests as a result since test ordering is random. Moved this to an instance variable so each test run would be deterministic.
...ent/src/test/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderOnJavaTestBase.java
Outdated
Show resolved
Hide resolved
...-client/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnJava.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java
Outdated
Show resolved
Hide resolved
if (metaFieldsPopulated) { | ||
return getFieldValueFromIndexedRecord(record, schema, RECORD_KEY_METADATA_FIELD).toString(); | ||
} | ||
return keyGenerator.getRecordKey((GenericRecord) record); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ideally this can be made general in HoodieReaderContext
. OK to keep it as is in this PR.
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public UnaryOperator<IndexedRecord> projectRecord(Schema from, Schema to, Map<String, String> renamedColumns) { | ||
if (!renamedColumns.isEmpty()) { | ||
throw new UnsupportedOperationException("Schema evolution is not supported for the test reader context"); | ||
throw new UnsupportedOperationException("Schema evolution is not supported for the HoodieAvroReaderContext"); | ||
} | ||
Map<String, Integer> fromFields = IntStream.range(0, from.getFields().size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we still need #projectRecord
, it would be good to cache the transformation based on <Schema from, Schema to, Map<String, String> renamedColumns>
instead of computing the transformation upon each record (BaseSparkInternalRowReaderContext
does something similar).
Another optimization is to push down the projection to the reader itself so the reader iterator directly returns the IndexRecord
based on the to
schema if possible to avoid reinstantiating the record here. This may require more investigation, and we can keep the functional correctness without worrying about the performance in this PR for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to building the transform. There are other places we can do this as well that I have found while digging into this code path.
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
Show resolved
Hide resolved
hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/read/TestCustomMerger.java
Show resolved
Hide resolved
// Create dedicated merger to avoid current delete logic holes. | ||
// TODO: Unify delete logic (HUDI-7240). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linliu-code is this fixed?
hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/read/TestEventTimeMerging.java
Show resolved
Hide resolved
...op-common/src/test/java/org/apache/hudi/common/table/read/TestOverwriteWithLatestMerger.java
Show resolved
Hide resolved
…rContext.java Co-authored-by: Y Ethan Guo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @the-other-tim-brown let's make sure the follow-ups are tracked or addressed in subsequent PRs.
Change Logs
Impact
In order to migrate all reader paths to a consistent logical flow, we need to define readers that work with all the required engines. Currently the engines will all work with the Avro IndexedRecords since that is what the LogScanners will output today. This is the common denominator that will allow us to switch over the code and then focus on building optimal reader paths for each engine.
Risk level (write none, low medium or high below)
None, this PR in itself is just setting up a larger set of refactoring
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist