Skip to content

Commit 3bc3d9f

Browse files
committed
ENH pairwise L1 distances for sparse matrices
Simple, O(n_features) temp space algorithm: densify row by row, then subtract and compute L1 norm. Added BLAS support code (cblas_dasum) to speed this up by a factor of two on x86-64 w/ GCC and ATLAS for pair of 93% sparse matrices of shape 1000*3000. That's an order of magnitude faster than the dense version. Also cleaned up chi2 kernel code while I was at it and added a nogil decl.
1 parent 4de5e6c commit 3bc3d9f

File tree

7 files changed

+7298
-4576
lines changed

7 files changed

+7298
-4576
lines changed

sklearn/metrics/pairwise.py

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
from ..externals.joblib import delayed
5454
from ..externals.joblib.parallel import cpu_count
5555

56-
from .pairwise_fast import _chi2_kernel_fast
56+
from .pairwise_fast import _chi2_kernel_fast, _sparse_manhattan
5757

5858

5959
# Utility Functions
@@ -437,6 +437,7 @@ def manhattan_distances(X, Y=None, sum_over_features=True,
437437
sum_over_features : bool, default=True
438438
If True the function returns the pairwise distance matrix
439439
else it returns the componentwise L1 pairwise-distances.
440+
Not supported for sparse matrix inputs.
440441
441442
size_threshold : int, default=5e8
442443
Avoid creating temporary matrices bigger than size_threshold (in
@@ -450,7 +451,7 @@ def manhattan_distances(X, Y=None, sum_over_features=True,
450451
(n_samples_X * n_samples_Y, n_features) and D contains the
451452
componentwise L1 pairwise-distances (ie. absolute difference),
452453
else shape is (n_samples_X, n_samples_Y) and D contains
453-
the pairwise l1 distances.
454+
the pairwise L1 distances.
454455
455456
Examples
456457
--------
@@ -472,10 +473,21 @@ def manhattan_distances(X, Y=None, sum_over_features=True,
472473
array([[ 1., 1.],
473474
[ 1., 1.]]...)
474475
"""
475-
if issparse(X) or issparse(Y):
476-
raise ValueError("manhattan_distance does not support sparse"
477-
" matrices.")
478476
X, Y = check_pairwise_arrays(X, Y)
477+
478+
if issparse(X) or issparse(Y):
479+
if not sum_over_features:
480+
raise TypeError("sum_over_features=%r not supported"
481+
" for sparse matrices" % sum_over_features)
482+
483+
X = csr_matrix(X, copy=False)
484+
Y = csr_matrix(Y, copy=False)
485+
D = np.zeros((X.shape[0], Y.shape[0]))
486+
_sparse_manhattan(X.data, X.indices, X.indptr,
487+
Y.data, Y.indices, Y.indptr,
488+
X.shape[1], D)
489+
return D
490+
479491
temporary_size = X.size * Y.shape[-1]
480492
# Convert to bytes
481493
temporary_size *= X.itemsize

0 commit comments

Comments
 (0)