Skip to content

Commit b2799ca

Browse files
committed
DOC: Added memory workaround to dbscan doc
1 parent 281ac0c commit b2799ca

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

doc/modules/clustering.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -767,6 +767,20 @@ by black points below.
767767
The possibility to use custom metrics is retained;
768768
for details, see :class:`NearestNeighbors`.
769769

770+
This implementation is by default not memory efficient because it constructs
771+
a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot
772+
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
773+
A couple of mechanisms for getting around this are:
774+
775+
- A sparse radius neighborhood graph (where missing
776+
entries are presumed to be out of eps) can be precomputed in a memory-efficient
777+
way and dbscan can be run over this with ``metric='precomputed'``.
778+
779+
- The dataset can be compressed, either by removing exact duplicates if
780+
these occur in your data, or by using BIRCH. Then you only have a
781+
relatively small number of representatives for a large number of points.
782+
You can then provide a ``sample_weight`` when fitting DBSCAN.
783+
770784
.. topic:: References:
771785

772786
* "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases

0 commit comments

Comments
 (0)