aashish24
diff --git a/‎doc/developers/index.rst‎
Lines changed: 10 additions & 10 deletions b/‎doc/developers/index.rst‎
Lines changed: 10 additions & 10 deletions
diff --git a/‎doc/developers/performance.rst‎
Lines changed: 2 additions & 2 deletions b/‎doc/developers/performance.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/developers/utilities.rst‎
Lines changed: 2 additions & 2 deletions b/‎doc/developers/utilities.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/install.rst‎
Lines changed: 6 additions & 6 deletions b/‎doc/install.rst‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎doc/modules/biclustering.rst‎
Lines changed: 2 additions & 2 deletions b/‎doc/modules/biclustering.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/modules/clustering.rst‎
Lines changed: 50 additions & 45 deletions b/‎doc/modules/clustering.rst‎
Lines changed: 50 additions & 45 deletions
@@ -100,8 +100,8 @@ email to the mailing list in order to get more visibility.
 .. note::
 
   In the above setup, your ``origin`` remote repository points to
-  YourLogin/scikit-learn.git. If you wish to `fetch/merge` from the main
-  repository instead of your `forked` one, you will need to add another remote
+  YourLogin/scikit-learn.git. If you wish to fetch/merge from the main
+  repository instead of your forked one, you will need to add another remote
   to use instead of ``origin``. If we choose the name ``upstream`` for it, the
   command will be::
 
@@ -242,7 +242,7 @@ Finally, any math and equations, followed by references,
 can be added to further the documentation. Not starting the
 documentation with the maths makes it more friendly towards
 users that are just interested in what the feature will do, as
-opposed to how it works `under the hood`.
+opposed to how it works "under the hood".
 
 
 .. warning:: **Sphinx version**
@@ -372,7 +372,7 @@ In addition, we add the following guidelines:
       that is implemented in ``sklearn.foo.bar.baz``,
       the test should import it from ``sklearn.foo``.
 
-    * **Please don't use `import *` in any case**. It is considered harmful
+    * **Please don't use ``import *`` in any case**. It is considered harmful
       by the `official Python recommendations
       <http://docs.python.org/howto/doanddont.html#from-module-import>`_.
       It makes the code harder to read as the origin of symbols is no
@@ -396,7 +396,7 @@ Input validation
 
 The module :mod:`sklearn.utils` contains various functions for doing input
 validation and conversion. Sometimes, ``np.asarray`` suffices for validation;
-do `not` use ``np.asanyarray`` or ``np.atleast_2d``, since those let NumPy's
+do *not* use ``np.asanyarray`` or ``np.atleast_2d``, since those let NumPy's
 ``np.matrix`` through, which has a different API
 (e.g., ``*`` means dot product on ``np.matrix``,
 but Hadamard product on ``np.ndarray``).
@@ -634,14 +634,14 @@ an estimator without passing any arguments to it. The arguments should all
 correspond to hyperparameters describing the model or the optimisation
 problem the estimator tries to solve. These initial arguments (or parameters)
 are always remembered by the estimator.
-Also note that they should not be documented under the `Attributes` section,
-but rather under the `Parameters` section for that estimator.
+Also note that they should not be documented under the "Attributes" section,
+but rather under the "Parameters" section for that estimator.
 
 In addition, **every keyword argument accepted by ``__init__`` should
 correspond to an attribute on the instance**. Scikit-learn relies on this to
 find the relevant attributes to set on an estimator when doing model selection.
 
-To summarize, a `__init__` should look like::
+To summarize, an ``__init__`` should look like::
 
     def __init__(self, param1=1, param2=2):
         self.param1 = param1
@@ -722,8 +722,8 @@ Estimated Attributes
 
 Attributes that have been estimated from the data must always have a name
 ending with trailing underscore, for example the coefficients of
-some regression estimator would be stored in a `coef_` attribute after
-`fit()` has been called.
+some regression estimator would be stored in a ``coef_`` attribute after
+``fit`` has been called.
 
 The last-mentioned attributes are expected to be overridden when
 you call ``fit`` a second time without taking any previous value into
 
@@ -360,8 +360,8 @@ directory::
          7     13.61 MB -152.59 MB       del b
          8     13.61 MB    0.00 MB       return a
 
-Another useful magic that ``memory_profiler`` defines is `%memit`, which is
-analogous to `%timeit`. It can be used as follows::
+Another useful magic that ``memory_profiler`` defines is ``%memit``, which is
+analogous to ``%timeit``. It can be used as follows::
 
     In [1]: import numpy as np
 
 
@@ -129,7 +129,7 @@ Efficient Random Sampling
 =========================
 
 - :func:`random.sample_without_replacement`: implements efficient algorithms
-  for sampling `n_samples` integers from a population of size `n_population`
+  for sampling ``n_samples`` integers from a population of size ``n_population``
   without replacement.
 
 
@@ -272,7 +272,7 @@ Hash Functions
 ==============
 
 - :func:`murmurhash3_32` provides a python wrapper for the
-  `MurmurHash3_x86_32` C++ non cryptographic hash function. This hash
+  ``MurmurHash3_x86_32`` C++ non cryptographic hash function. This hash
   function is suitable for implementing lookup tables, Bloom filters,
   Count Min Sketch, feature hashing and implicitly defined sparse
   random projections::
 
@@ -239,9 +239,9 @@ Arch Linux
 ----------
 
 Arch Linux's package is provided through the `official repositories
-<https://www.archlinux.org/packages/?q=scikit-learn>`_ as `python-scikit-learn`
-for Python 3 and `python2-scikit-learn` for Python 2. It can be installed
-by typing the following command:
+<https://www.archlinux.org/packages/?q=scikit-learn>`_ as
+``python-scikit-learn`` for Python 3 and ``python2-scikit-learn`` for Python 2.
+It can be installed by typing the following command:
 
 .. code-block:: none
 
@@ -266,9 +266,9 @@ scikit-learn is available via `pkgsrc-wip <http://pkgsrc-wip.sourceforge.net/>`_
 Fedora
 ------
 
-The Fedora package is called `python-scikit-learn` for the Python 2 version
-and `python3-scikit-learn` for the Python 3 version. Both versions can
-be installed using `yum`::
+The Fedora package is called ``python-scikit-learn`` for the Python 2 version
+and ``python3-scikit-learn`` for the Python 3 version. Both versions can
+be installed using ``yum``::
 
     $ sudo yum install python-scikit-learn
 
 
@@ -10,8 +10,8 @@ cluster rows and columns of a data matrix. These clusters of rows and
 columns are known as biclusters. Each determines a submatrix of the
 original data matrix with some desired properties.
 
-For instance, given a matrix of shape `(10, 10)`, one possible bicluster
-with three rows and two columns induces a submatrix of shape `(3, 2)`::
+For instance, given a matrix of shape ``(10, 10)``, one possible bicluster
+with three rows and two columns induces a submatrix of shape ``(3, 2)``::
 
     >>> import numpy as np
     >>> data = np.arange(100).reshape(10, 10)
 
@@ -8,10 +8,10 @@ Clustering
 unlabeled data can be performed with the module :mod:`sklearn.cluster`.
 
 Each clustering algorithm comes in two variants: a class, that implements
-the `fit` method to learn the clusters on train data, and a function,
+the ``fit`` method to learn the clusters on train data, and a function,
 that, given train data, returns an array of integer labels corresponding
 to the different clusters. For the class, the labels over the training
-data can be found in the `labels_` attribute.
+data can be found in the ``labels_`` attribute.
 
 .. currentmodule:: sklearn.cluster
 
@@ -53,7 +53,7 @@ Overview of clustering methods
 
    * - :ref:`K-Means <k_means>`
      - number of clusters
-     - Very large `n_samples`, medium `n_clusters` with
+     - Very large ``n_samples``, medium ``n_clusters`` with
        :ref:`MiniBatch code <mini_batch_kmeans>`
      - General-purpose, even cluster size, flat geometry, not too many clusters
      - Distances between points
@@ -66,32 +66,32 @@ Overview of clustering methods
 
    * - :ref:`Mean-shift <mean_shift>`
      - bandwidth
-     - Not scalable with n_samples
+     - Not scalable with ``n_samples``
      - Many clusters, uneven cluster size, non-flat geometry
      - Distances between points
 
    * - :ref:`Spectral clustering <spectral_clustering>`
      - number of clusters
-     - Medium `n_samples`, small `n_clusters`
+     - Medium ``n_samples``, small ``n_clusters``
      - Few clusters, even cluster size, non-flat geometry
      - Graph distance (e.g. nearest-neighbor graph)
 
    * - :ref:`Ward hierarchical clustering <hierarchical_clustering>`
      - number of clusters
-     - Large `n_samples` and `n_clusters`
+     - Large ``n_samples`` and ``n_clusters``
      - Many clusters, possibly connectivity constraints
      - Distances between points
 
    * - :ref:`Agglomerative clustering <hierarchical_clustering>`
      - number of clusters, linkage type, distance
-     - Large `n_samples` and `n_clusters`
+     - Large ``n_samples`` and ``n_clusters``
      - Many clusters, possibly connectivity constraints, non Euclidean
        distances
      - Any pairwise distance
 
    * - :ref:`DBSCAN <dbscan>`
      - neighborhood size
-     - Very large `n_samples`, medium `n_clusters`
+     - Very large ``n_samples``, medium ``n_clusters``
      - Non-flat geometry, uneven cluster sizes
      - Distances between nearest points
 
@@ -118,12 +118,12 @@ K-means
 
 The :class:`KMeans` algorithm clusters data by trying to separate samples
 in n groups of equal variance, minimizing a criterion known as the
-`inertia<inertia>` or within-cluster sum-of-squares.
+`inertia <inertia>` or within-cluster sum-of-squares.
 This algorithm requires the number of clusters to be specified.
 It scales well to large number of samples and has been used
 across a large range of application areas in many different fields.
 
-The k-means algorithm divides a set of :math:`N` samples :math:`X`:
+The k-means algorithm divides a set of :math:`N` samples :math:`X`
 into :math:`K` disjoint clusters :math:`C`,
 each described by the mean :math:`\mu_j` of the samples in the cluster.
 The means are commonly called the cluster "centroids";
@@ -146,7 +146,7 @@ It suffers from various drawbacks:
   better and zero is optimal. But in very high-dimensional spaces, Euclidean
   distances tend to become inflated
   (this is an instance of the so-called "curse of dimensionality").
-  Running a dimensionality reduction algorithm such as `PCA<PCA>`
+  Running a dimensionality reduction algorithm such as `PCA <PCA>`
   prior to k-means clustering can alleviate this problem
   and speed up the computations.
 
@@ -189,7 +189,7 @@ k-means++ initialization scheme, which has been implemented in scikit-learn
 random initialization, as shown in the reference.
 
 A parameter can be given to allow K-means to be run in parallel, called
-`n_jobs`. Giving this parameter a positive value uses that many processors
+``n_jobs``. Giving this parameter a positive value uses that many processors
 (default: 1). A value of -1 uses all available processors, with -2 using one
 less, and so on. Parallelization generally speeds up computation at the cost of
 memory (in this case, multiple copies of centroids need to be stored, one for
@@ -232,7 +232,7 @@ k-means, mini-batch k-means produces results that are generally only slightly
 worse than the standard algorithm.
 
 The algorithm iterates between two major steps, similar to vanilla k-means.
-In the first step, `b` samples are drawn randomly from the dataset, to form
+In the first step, :math:`b` samples are drawn randomly from the dataset, to form
 a mini-batch. These are then assigned to the nearest centroid. In the second
 step, the centroids are updated. In contrast to k-means, this is done on a
 per-sample basis. For each sample in the mini-batch, the assigned centroid
@@ -291,12 +291,12 @@ is given.
 
 Affinity Propagation can be interesting as it chooses the number of
 clusters based on the data provided. For this purpose, the two important
-parameters are the `preference`, which controls how many exemplars are
-used, and the `damping` factor.
+parameters are the *preference*, which controls how many exemplars are
+used, and the *damping factor*.
 
 The main drawback of Affinity Propagation is its complexity. The
-algorithm has a time complexity of the order :math:`O(N^2 T)`, where `N`
-is the number of samples and `T` is the number of iterations until
+algorithm has a time complexity of the order :math:`O(N^2 T)`, where :math:`N`
+is the number of samples and :math:`T` is the number of iterations until
 convergence. Further, the memory complexity is of the order
 :math:`O(N^2)` if a dense similarity matrix is used, but reducible if a
 sparse similarity matrix is used. This makes Affinity Propagation most
@@ -312,30 +312,34 @@ appropriate for small to medium sized datasets.
 
 **Algorithm description:**
 The messages sent between points belong to one of two categories. The first is
-the responsibility `r(i, k)`, which is the accumulated evidence that sample `k`
-should be the exemplar for sample `i`. The second is the availability `a(i, k)`
-which is the accumulated evidence that sample `i` should choose sample `k` to
-be its exemplar, and considers the values for all other samples that `k` should
+the responsibility :math:`r(i, k)`,
+which is the accumulated evidence that sample :math:`k`
+should be the exemplar for sample :math:`i`.
+The second is the availability :math:`a(i, k)`
+which is the accumulated evidence that sample :math:`i`
+should choose sample :math:`k` to be its exemplar,
+and considers the values for all other samples that :math:`k` should
 be an exemplar. In this way, exemplars are chosen by samples if they are (1)
 similar enough to many samples and (2) chosen by many samples to be
 representative of themselves.
 
-More formally, the responsibility of a sample `k` to be the exemplar of sample
-`i` is given by:
+More formally, the responsibility of a sample :math:`k`
+to be the exemplar of sample :math:`i` is given by:
 
 .. math::
 
     r(i, k) \leftarrow s(i, k) - max [ a(i, \acute{k}) + s(i, \acute{k}) \forall \acute{k} \neq k ]
 
-Where :math:`s(i, k)` is the similarity between samples `i` and `k`. The
-availability of sample `k` to be the exemplar of sample `i` is given by:
+Where :math:`s(i, k)` is the similarity between samples :math:`i` and :math:`k`.
+The availability of sample :math:`k`
+to be the exemplar of sample :math:`i` is given by:
 
 .. math::
 
     a(i, k) \leftarrow min [0, r(k, k) + \sum_{\acute{i}~s.t.~\acute{i} \notin \{i, k\}}{r(\acute{i}, k)}]
 
-To begin with, all values for `r` and `a` are set to zero, and the calculation
-of each iterates until convergence.
+To begin with, all values for :math:`r` and :math:`a` are set to zero,
+and the calculation of each iterates until convergence.
 
 .. _mean_shift:
 
@@ -367,9 +371,9 @@ the mean of the samples within its neighborhood:
     m(x_i) = \frac{\sum_{x_j \in N(x_i)}K(x_j - x_i)x_j}{\sum_{x_j \in N(x_i)}K(x_j - x_i)}
 
 The algorithm automatically sets the number of clusters, instead of relying on a
-parameter `bandwidth`, which dictates the size of the region to search through.
+parameter ``bandwidth``, which dictates the size of the region to search through.
 This parameter can be set manually, but can be estimated using the provided
-`estimate_bandwidth` function, which is called if the bandwidth is not set.
+``estimate_bandwidth`` function, which is called if the bandwidth is not set.
 
 The algorithm is not highly scalable, as it requires multiple nearest neighbor
 searches during the execution of the algorithm. The algorithm is guaranteed to
@@ -463,16 +467,16 @@ Different label assignment strategies
 ---------------------------------------
 
 Different label assignment strategies can be used, corresponding to the
-`assign_labels` parameter of :class:`SpectralClustering`.
-The `kmeans` strategy can match finer details of the data, but it can be
-more unstable. In particular, unless you control the `random_state`, it
+``assign_labels`` parameter of :class:`SpectralClustering`.
+The ``"kmeans"`` strategy can match finer details of the data, but it can be
+more unstable. In particular, unless you control the ``random_state``, it
 may not be reproducible from run-to-run, as it depends on a random
-initialization. On the other hand, the `discretize` strategy is 100%
+initialization. On the other hand, the ``"discretize"`` strategy is 100%
 reproducible, but it tends to create parcels of fairly even and
 geometrical shape.
 
 =====================================  =====================================
- `assign_labels="kmeans"`               `assign_labels="discretize"`
+ ``assign_labels="kmeans"`              ``assign_labels="discretize"``
 =====================================  =====================================
 |lena_kmeans|                          |lena_discretize|
 =====================================  =====================================
@@ -697,13 +701,14 @@ cluster is therefore a set of core samples, each close to each other
 (measured by some distance measure)
 and a set of non-core samples that are close to a core sample (but are not
 themselves core samples). There are two parameters to the algorithm,
-`min_samples` and `eps`, which define formally what we mean when we say *dense*.
-A higher `min_samples` or lower `eps` indicate higher density necessary to form
-a cluster.
+``min_samples`` and ``eps``,
+which define formally what we mean when we say *dense*.
+Higher ``min_samples`` or lower ``eps``
+indicate higher density necessary to form a cluster.
 
 More formally, we define a core sample as being a sample in the dataset such
-that there exist `min_samples` other samples within a distance of
-`eps`, which are defined as *neighbors* of the core sample. This tells
+that there exist ``min_samples`` other samples within a distance of
+``eps``, which are defined as *neighbors* of the core sample. This tells
 us that the core sample is in a dense area of the vector space. A cluster
 is a set of core samples, that can be built by recursively by taking a core
 sample, finding all of its neighbors that are core samples, finding all of
@@ -713,9 +718,9 @@ in the cluster but are not themselves core samples. Intuitively, these samples
 are on the fringes of a cluster.
 
 Any core sample is part of a cluster, by definition. Further, any cluster has
-at least `min_samples` points in it, following the definition of a core
+at least ``min_samples`` points in it, following the definition of a core
 sample. For any sample that is not a core sample, and does have a
-distance higher than `eps` to any core sample, it is considered an outlier by
+distance higher than ``eps`` to any core sample, it is considered an outlier by
 the algorithm.
 
 In the figure below, the color indicates cluster membership, with large circles
@@ -739,9 +744,9 @@ by black points below.
     always belong to the same clusters (although the labels may be
     different). The non-determinism comes from deciding to which cluster a
     non-core sample belongs. A non-core sample can have a distance lower
-    than `eps` to two core samples in different clusters. By the
+    than ``eps`` to two core samples in different clusters. By the
     triangular inequality, those two core samples must be more distant than
-    `eps` from each other, or they would be in the same cluster. The non-core
+    ``eps`` from each other, or they would be in the same cluster. The non-core
     sample is assigned to whichever cluster is generated first, where
     the order is determined randomly. Other than the ordering of
     the dataset, the algorithm is deterministic, making the results relatively
@@ -798,7 +803,7 @@ chance normalization**::
   >>> metrics.adjusted_rand_score(labels_true, labels_pred)  # doctest: +ELLIPSIS
   0.24...
 
-One can permute 0 and 1 in the predicted labels and rename `2` by `3` and get
+One can permute 0 and 1 in the predicted labels, rename 2 to 3, and get
 the same score::
 
   >>> labels_pred = [1, 1, 0, 0, 3, 3]
@@ -921,7 +926,7 @@ proposed more recently and is **normalized against chance**::
   >>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)  # doctest: +ELLIPSIS
   0.22504...
 
-One can permute 0 and 1 in the predicted labels and rename `2` by `3` and get
+One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get
 the same score::
 
   >>> labels_pred = [1, 1, 0, 0, 3, 3]