-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
WIP: Fix precision of DataFrame.combine and DataFrame.combine_first #62691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Fix precision of DataFrame.combine and DataFrame.combine_first #62691
Conversation
3ec10fc to
451621b
Compare
| if y.name not in self.columns: | ||
| return y_values | ||
|
|
||
| return expressions.where(mask, y_values, x_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not using expressions.where have a perf impact?
| (-1666880195890293744, "int64"), | ||
| ), | ||
| ) | ||
| def test_combine_first_preserve_precision(self, wide_val, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you made changes to both combine and combine_first, but only a test for combine_first. can you test the fixed bug in combine?
| """Resolve the combined dtypes according to the original dtypes.""" | ||
| cast_map: dict[IndexLabel, DtypeObj] = {} | ||
| for col in combined_df.columns: | ||
| ser = combined_df[col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not going to be robust to non-unique columns. can you use .iloc[:, i] instead?
| orig_dt_self = self_orig.dtypes.get(col) | ||
| orig_dt_other = other_orig.dtypes.get(col) | ||
|
|
||
| was_promoted = (orig_dt_self in [np.int64, np.uint64]) or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no risk for smaller int widths?
| 2 NaN 3.0 1.0 | ||
| """ | ||
|
|
||
| # GH#60128 Integers n where |n| > 2**53 would lose precision after align |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align only upcasts to float when the frames have different indexes right? would it be simpler to just warn users and tell them to do alignment before calling .combine?
|
My bad! I meant to close this draft PR. |
|
This is not the right solution to the problem. A maintainer told me that casting to EA dtypes and back is an controversial approach and discouraged in this codebase. I have already merged this PR which partially addresses bug report. The combiner did not previously preserve EA dtypes as it should, leaving users no way to avoid float64 lossy conversion even with EA dtypes. I'm currently thinking of a stopgap that can close the bug report without using EA dtypes. |
EDIT: This was my first attempt. Converting to EA Dtypes and back just to prevent lossy conversion is a discouraged approach not taken in this codebase, according to maintainers.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.