[HUDI-8902] Fix the incorrect data read after changing the column type from float to double #13188

TheR1sing3un · 2025-04-21T07:15:22Z

Fix the incorrect data read after changing the column type from float to double

There are two reasons for the above problems：

InternalSchemaCache cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the timeline directory. However, InternalSchemaCache still retrievals files from basePath/.hoodie, thus failing to obtain the correct schema.
When we read parquet file without enableVectorizedReader, we use spark 's CAST for type conversion when we need to do scheme evolution, but spark has a loss of precision when dealing with CAST that converts float to double.

Change logs:

pass the correct timeline path for InternalSchemaCache
for schema evolution without enableVectorizedReader, for the case CAST(FLOAT as DOUBLE) we turn it to CAST(CAST(FLOAT as STRING) as DOUBLE).

Change Logs

pass the correct timeline path for InternalSchemaCache
for schema evolution without enableVectorizedReader, for the case CAST(FLOAT as DOUBLE) we turn it to CAST(CAST(FLOAT as STRING) as DOUBLE).

Describe context and summary for this change. Highlight if any code was copied.

Impact

fix the incorrectness after change data type from float to double

Risk level (write none, low medium or high below)

low

Documentation Update

none

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…float to double Fix the incorrect data read after changing the column type from float to double There are two reasons for the above problems： 1. `InternalSchemaCache` cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the `timeline` directory. However, `InternalSchemaCache` still retrievals files from `basePath/.hoodie`, thus failing to obtain the correct schema. 2. When we read parquet file without `enableVectorizedReader`, we use spark 's `CAST` for type conversion when we need to do scheme evolution, but spark has a loss of precision when dealing with `CAST` that converts `float` to `double`. Change logs: 1. pass the correct timeline path for `InternalSchemaCache` 2. for schema evolution without `enableVectorizedReader`, for the case `CAST(FLOAT as DOUBLE)` we turn it to `CAST(CAST(FLOAT as STRING) as DOUBLE)`. Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un · 2025-04-21T08:50:01Z

@hudi-bot run azure

hudi-bot · 2025-04-21T08:52:42Z

CI report:

4af00f8 Azure: PENDING Azure: CANCELED

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2025-04-21T09:16:10Z

InternalSchemaCache cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the timeline directory. However, InternalSchemaCache still retrievals files from basePath/.hoodie, thus failing to obtain the correct schema.

Are you saying the schema evolution is broken, why the schema evolution related tests can pass?

codope

For the first issue related to InternalSchemaCache, could you please share the stacktrace? I am wondering why any of our tests that enable schema evolution did not catch it. Note that the default for hoodie.schema.on.read.enable is false.

codope · 2025-04-21T09:38:11Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

-          Cast(attr, dstType, if (needTimeZone) timeZoneId else None)
+
+          // work around for the case when cast float to double
+          if (srcType == FloatType && dstType == DoubleType) {


ideally double cast should be avoided but given that spark's type conversion loses precision and it is not as if we are going to do this for every record every time, I am ok with this change.

TheR1sing3un · 2025-04-21T09:56:05Z

For the first issue related to InternalSchemaCache, could you please share the stacktrace?

I execute the ut : TestSpark3DDL::Test alter table properties and add rename drop column with table services and then set a breakpoint before query the data with type change from float to double.

And then set a breakpoint at InternalSchema::getInternalSchemaByVersionId, I find a FileNotFoundException cause by wrong timeline path.

stacktrace:

TheR1sing3un · 2025-04-21T10:14:05Z

Are you saying the schema evolution is broken, why the schema evolution related tests can pass?

Because when the schema cannot be obtained from the commit file, it will be obtained from the schema history. Our tests can all go through this process, so there are no errors in the tests.

danny0405 · 2025-04-21T10:23:58Z

Azure CI passed: https://dev.azure.com/apachehudi/hudi-oss-ci/_build/results?buildId=5016&view=results

…e from float to double (apache#13188) Fix the incorrect data read after changing the column type from float to double There are two reasons for the above problems： 1. `InternalSchemaCache` cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the `timeline` directory. However, `InternalSchemaCache` still retrievals files from `basePath/.hoodie`, thus failing to obtain the correct schema. 2. When we read parquet file without `enableVectorizedReader`, we use spark 's `CAST` for type conversion when we need to do scheme evolution, but spark has a loss of precision when dealing with `CAST` that converts `float` to `double`. Change logs: 1. pass the correct timeline path for `InternalSchemaCache` 2. for schema evolution without `enableVectorizedReader`, for the case `CAST(FLOAT as DOUBLE)` we turn it to `CAST(CAST(FLOAT as STRING) as DOUBLE)`. Signed-off-by: TheR1sing3un <[email protected]> (cherry picked from commit e43841f)

github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 21, 2025

TheR1sing3un mentioned this pull request Apr 21, 2025

[HUDI-9302] Enable vectorized reading for file slice without log file #13127

Merged

4 tasks

danny0405 added release-1.0.2 priority:blocker labels Apr 21, 2025

danny0405 added this to Hudi PR Support Apr 21, 2025

github-project-automation bot moved this to 🆕 New in Hudi PR Support Apr 21, 2025

codope reviewed Apr 21, 2025

View reviewed changes

codope approved these changes Apr 21, 2025

View reviewed changes

github-project-automation bot moved this from 🆕 New to 🛬 Near landing in Hudi PR Support Apr 21, 2025

codope merged commit e43841f into apache:master Apr 21, 2025
59 of 60 checks passed

github-project-automation bot moved this from 🛬 Near landing to ✅ Done in Hudi PR Support Apr 21, 2025

jonvex mentioned this pull request May 12, 2025

[HUDI-8902] Fix schema evolution from float to double for avro log blocks #13289

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-8902] Fix the incorrect data read after changing the column type from float to double #13188

[HUDI-8902] Fix the incorrect data read after changing the column type from float to double #13188

Uh oh!

TheR1sing3un commented Apr 21, 2025

Uh oh!

TheR1sing3un commented Apr 21, 2025

Uh oh!

hudi-bot commented Apr 21, 2025

Uh oh!

danny0405 commented Apr 21, 2025

Uh oh!

codope left a comment

Uh oh!

codope Apr 21, 2025

Uh oh!

TheR1sing3un commented Apr 21, 2025 •

edited

Loading

Uh oh!

TheR1sing3un commented Apr 21, 2025

Uh oh!

danny0405 commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

[HUDI-8902] Fix the incorrect data read after changing the column type from float to double #13188

[HUDI-8902] Fix the incorrect data read after changing the column type from float to double #13188

Uh oh!

Conversation

TheR1sing3un commented Apr 21, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

TheR1sing3un commented Apr 21, 2025

Uh oh!

hudi-bot commented Apr 21, 2025

CI report:

Uh oh!

danny0405 commented Apr 21, 2025

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

TheR1sing3un commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheR1sing3un commented Apr 21, 2025

Uh oh!

danny0405 commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

TheR1sing3un commented Apr 21, 2025 •

edited

Loading