Skip to content

ENH: Update testing.assert_frame_equal message to list all cols with diffs #62930

@jzwick

Description

@jzwick

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish when testing.assert_frame_equal failed in one my unit tests that it would tell me all of the columns which have diffs.

The documentation for this function states:

Check that left and right DataFrame are equal.

This function is intended to compare two DataFrames and output any differences. It is mostly intended for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.

However, it does not output any differences. It goes through the columns and stops at the first diff it encounters. For someone unfamiliar with this behavior, it can be very confusing - you think only the one reported column has a diff, you fix it, then you re-run and now someone in fixing the first column you've broken another column!?

Once you understand that this is how it behaves, you can work around it by setting a breakpoint in your code and running pd.compare() but until you understand this, it's quite perplexing.

Example:

df1 = pd.DataFrame([ ['A',1],['B',2,] ],columns=['letter','number'])
df2 = pd.DataFrame([ ['a',1],['B',3,] ],columns=['letter','number'])
pd.testing.assert_frame_equal(df1,df2)

returns

AssertionError: DataFrame.iloc[:, 0] (column name="letter") are different

DataFrame.iloc[:, 0] (column name="letter") values are different (50.0 %)
[index]: [0, 1]
[left]:  [A, B]
[right]: [a, B]
At positional index 0, first diff: A != a

with no indication of the errors in the number column.

Feature Description

I propose to enhance something like the following:

AssertionError: DataFrames are different

The following columns contain diffs: ["letter","number"]

First diff: DataFrame.iloc[:, 0] (column name="letter") values are different (50.0 %)
[index]: [0, 1]
[left]:  [A, B]
[right]: [a, B]
At positional index 0, first diff: A != a

or

AssertionError: DataFrame.iloc[:, [0,1]] (column name=["letter","number"]) are different

First diff: DataFrame.iloc[:, 0] (column name="letter") values are different (50.0 %)
[index]: [0, 1]
[left]:  [A, B]
[right]: [a, B]
At positional index 0, first diff: A != a

Alternative Solutions

Alternatively, if the community strongly prefers to keep the existing behavior, I would advocate that we should update the docs to make this behavior more explicitly clear to the user.

Additional Context

If there are both Index and column differences, the Index differences are flagged first, example:

df3 = df1.set_index('letter')
df4 = df2.set_index('letter')
pd.testing.assert_frame_equal(df3,df4)

returns:

AssertionError: DataFrame.index are different

DataFrame.index values are different (50.0 %)
[left]:  Index(['A', 'B',], dtype='object', name='letter')
[right]: Index(['a', 'B'], dtype='object', name='letter')
At positional index 0, first diff: A != a

in which case I would recommend something like:

AssertionError: DataFrames are different

- DataFrame.index are different
- The following columns contain diffs: ["number"]

First diff: DataFrame.index values are different (50.0 %)
[left]:  Index(['A', 'B'], dtype='object', name='letter')
[right]: Index(['a', 'B'], dtype='object', name='letter')
At positional index 0, first diff: A != a

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions