Skip to content

[HUDI-8902] Fix the incorrect data read after changing the column type from float to double #13188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 21, 2025

Conversation

TheR1sing3un
Copy link
Member

Fix the incorrect data read after changing the column type from float to double

There are two reasons for the above problems:

  1. InternalSchemaCache cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the timeline directory. However, InternalSchemaCache still retrievals files from basePath/.hoodie, thus failing to obtain the correct schema.
  2. When we read parquet file without enableVectorizedReader, we use spark 's CAST for type conversion when we need to do scheme evolution, but spark has a loss of precision when dealing with CAST that converts float to double.

Change logs:

  1. pass the correct timeline path for InternalSchemaCache
  2. for schema evolution without enableVectorizedReader, for the case CAST(FLOAT as DOUBLE) we turn it to CAST(CAST(FLOAT as STRING) as DOUBLE).

Change Logs

  1. pass the correct timeline path for InternalSchemaCache
  2. for schema evolution without enableVectorizedReader, for the case CAST(FLOAT as DOUBLE) we turn it to CAST(CAST(FLOAT as STRING) as DOUBLE).

Describe context and summary for this change. Highlight if any code was copied.

Impact

fix the incorrectness after change data type from float to double

Risk level (write none, low medium or high below)

low

Documentation Update

none

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

…float to double

Fix the incorrect data read after changing the column type from float to double

There are two reasons for the above problems:
1. `InternalSchemaCache` cannot obtain the path of the commit file correctly.
    Because after version 1.x, we placed the commit file separately in the `timeline` directory.
    However, `InternalSchemaCache` still retrievals files from `basePath/.hoodie`, thus failing to obtain the correct schema.
2. When we read parquet file without `enableVectorizedReader`, we use spark 's `CAST` for type conversion when we need to do scheme evolution,
    but spark has a loss of precision when dealing with `CAST` that converts `float` to `double`.

Change logs:
1. pass the correct timeline path for `InternalSchemaCache`
2. for schema evolution without `enableVectorizedReader`, for the case `CAST(FLOAT as DOUBLE)` we turn it to `CAST(CAST(FLOAT as STRING) as DOUBLE)`.

Signed-off-by: TheR1sing3un <[email protected]>
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 21, 2025
@TheR1sing3un
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

InternalSchemaCache cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the timeline directory. However, InternalSchemaCache still retrievals files from basePath/.hoodie, thus failing to obtain the correct schema.

Are you saying the schema evolution is broken, why the schema evolution related tests can pass?

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first issue related to InternalSchemaCache, could you please share the stacktrace? I am wondering why any of our tests that enable schema evolution did not catch it. Note that the default for hoodie.schema.on.read.enable is false.

Cast(attr, dstType, if (needTimeZone) timeZoneId else None)

// work around for the case when cast float to double
if (srcType == FloatType && dstType == DoubleType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally double cast should be avoided but given that spark's type conversion loses precision and it is not as if we are going to do this for every record every time, I am ok with this change.

@TheR1sing3un
Copy link
Member Author

TheR1sing3un commented Apr 21, 2025

For the first issue related to InternalSchemaCache, could you please share the stacktrace?

I execute the ut : TestSpark3DDL::Test alter table properties and add rename drop column with table services and then set a breakpoint before query the data with type change from float to double.
image

And then set a breakpoint at InternalSchema::getInternalSchemaByVersionId, I find a FileNotFoundException cause by wrong timeline path.

image

stacktrace:

image

@github-project-automation github-project-automation bot moved this from 🆕 New to 🛬 Near landing in Hudi PR Support Apr 21, 2025
@TheR1sing3un
Copy link
Member Author

Are you saying the schema evolution is broken, why the schema evolution related tests can pass?

Because when the schema cannot be obtained from the commit file, it will be obtained from the schema history. Our tests can all go through this process, so there are no errors in the tests.
image

@danny0405
Copy link
Contributor

@codope codope merged commit e43841f into apache:master Apr 21, 2025
59 of 60 checks passed
@github-project-automation github-project-automation bot moved this from 🛬 Near landing to ✅ Done in Hudi PR Support Apr 21, 2025
voonhous pushed a commit to voonhous/hudi that referenced this pull request Apr 21, 2025
…e from float to double (apache#13188)

Fix the incorrect data read after changing the column type from float to double

There are two reasons for the above problems:
1. `InternalSchemaCache` cannot obtain the path of the commit file correctly.
    Because after version 1.x, we placed the commit file separately in the `timeline` directory.
    However, `InternalSchemaCache` still retrievals files from `basePath/.hoodie`, thus failing to obtain the correct schema.
2. When we read parquet file without `enableVectorizedReader`, we use spark 's `CAST` for type conversion when we need to do scheme evolution,
    but spark has a loss of precision when dealing with `CAST` that converts `float` to `double`.

Change logs:
1. pass the correct timeline path for `InternalSchemaCache`
2. for schema evolution without `enableVectorizedReader`, for the case `CAST(FLOAT as DOUBLE)` we turn it to `CAST(CAST(FLOAT as STRING) as DOUBLE)`.

Signed-off-by: TheR1sing3un <[email protected]>
(cherry picked from commit e43841f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:blocker release-1.0.2 size:S PR with lines of changes in (10, 100]
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

4 participants