-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[HUDI-8902] Fix the incorrect data read after changing the column type from float to double #13188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…float to double Fix the incorrect data read after changing the column type from float to double There are two reasons for the above problems: 1. `InternalSchemaCache` cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the `timeline` directory. However, `InternalSchemaCache` still retrievals files from `basePath/.hoodie`, thus failing to obtain the correct schema. 2. When we read parquet file without `enableVectorizedReader`, we use spark 's `CAST` for type conversion when we need to do scheme evolution, but spark has a loss of precision when dealing with `CAST` that converts `float` to `double`. Change logs: 1. pass the correct timeline path for `InternalSchemaCache` 2. for schema evolution without `enableVectorizedReader`, for the case `CAST(FLOAT as DOUBLE)` we turn it to `CAST(CAST(FLOAT as STRING) as DOUBLE)`. Signed-off-by: TheR1sing3un <[email protected]>
@hudi-bot run azure |
Are you saying the schema evolution is broken, why the schema evolution related tests can pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first issue related to InternalSchemaCache, could you please share the stacktrace? I am wondering why any of our tests that enable schema evolution did not catch it. Note that the default for hoodie.schema.on.read.enable
is false.
Cast(attr, dstType, if (needTimeZone) timeZoneId else None) | ||
|
||
// work around for the case when cast float to double | ||
if (srcType == FloatType && dstType == DoubleType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally double cast should be avoided but given that spark's type conversion loses precision and it is not as if we are going to do this for every record every time, I am ok with this change.
…e from float to double (apache#13188) Fix the incorrect data read after changing the column type from float to double There are two reasons for the above problems: 1. `InternalSchemaCache` cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in the `timeline` directory. However, `InternalSchemaCache` still retrievals files from `basePath/.hoodie`, thus failing to obtain the correct schema. 2. When we read parquet file without `enableVectorizedReader`, we use spark 's `CAST` for type conversion when we need to do scheme evolution, but spark has a loss of precision when dealing with `CAST` that converts `float` to `double`. Change logs: 1. pass the correct timeline path for `InternalSchemaCache` 2. for schema evolution without `enableVectorizedReader`, for the case `CAST(FLOAT as DOUBLE)` we turn it to `CAST(CAST(FLOAT as STRING) as DOUBLE)`. Signed-off-by: TheR1sing3un <[email protected]> (cherry picked from commit e43841f)
Fix the incorrect data read after changing the column type from float to double
There are two reasons for the above problems:
InternalSchemaCache
cannot obtain the path of the commit file correctly. Because after version 1.x, we placed the commit file separately in thetimeline
directory. However,InternalSchemaCache
still retrievals files frombasePath/.hoodie
, thus failing to obtain the correct schema.enableVectorizedReader
, we use spark 'sCAST
for type conversion when we need to do scheme evolution, but spark has a loss of precision when dealing withCAST
that convertsfloat
todouble
.Change logs:
InternalSchemaCache
enableVectorizedReader
, for the caseCAST(FLOAT as DOUBLE)
we turn it toCAST(CAST(FLOAT as STRING) as DOUBLE)
.Change Logs
InternalSchemaCache
enableVectorizedReader
, for the caseCAST(FLOAT as DOUBLE)
we turn it toCAST(CAST(FLOAT as STRING) as DOUBLE)
.Describe context and summary for this change. Highlight if any code was copied.
Impact
fix the incorrectness after change data type from float to double
Risk level (write none, low medium or high below)
low
Documentation Update
none
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist