Skip to content

Conversation

@TomDLT
Copy link
Member

@TomDLT TomDLT commented Oct 12, 2020

Fixes #18583.

Calling radius_neighbors method of NearestNeighbors was not sorting the result by distance when sort_results=True, when fitted with algorithm="brute".
This PR fixes this issue, and clarifies the docstring.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TomDLT , some minor comments but this looks good

Comment on lines +1025 to +1027
if not return_distance:
raise ValueError("return_distance must be True "
"if sort_results is True.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this validation be done in all cases, and not just with 'brute'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation is already done in the method query_radius for "ball_tree" and "kd_tree", and it is not necessary with a precomputed sparse graph.

raise ValueError("return_distance must be True "
"if sort_results is True.")
for ii in range(len(neigh_dist)):
order = np.argsort(neigh_dist[ii], kind='mergesort')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, why do we need the sort to be stable here?

Copy link
Member Author

@TomDLT TomDLT Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used mergesort to be consistent with the sorting done in _check_precomputed, but I don't remember why we chose it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not return_distance:
raise ValueError("return_distance must be True "
"if sort_results is True.")
for ii in range(len(neigh_dist)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can vectorize this operation instead of looping over all samples? IIUC the arrays in neigh_dist may be of different lengths so that might not be possible, but I'm not familiar enough to confirm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible, as you said each row is an array of different length.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM when green!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. Thanks @TomDLT.

@ogrisel ogrisel merged commit a1c17af into scikit-learn:master Oct 14, 2020
@ogrisel ogrisel deleted the sort_results branch October 14, 2020 14:18
amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

radius_neighbors method of NearestNeighbors doesn't sort by distance when sort_results=True

3 participants