Pipeline.DPA module

class Pipeline.DPA.DensityPeakAdvanced(Z=1, metric='euclidean', densities=None, err_densities=None, k_hat=None, nn_distances=None, nn_indices=None, affinity='precomputed', density_algo='PAk', k_max=1000, D_thr=23.92812698, dim=None, dim_algo='twoNN', blockAn=True, block_ratio=5, frac=1, n_jobs=None)[source]

Bases: sklearn.base.ClusterMixin, sklearn.base.BaseEstimator

Class definition for the non-parametric Density Peak clustering.

The default pipeline makes use of the PAk density estimator and of the TWO-NN intristic dimension estimator. The densities and the corresponding errors can also be provided as precomputed arrays.

Parameters
  • Z (float, default = 1) – The number of standard deviations, which fixes the level of statistical confidence at which one decides to consider a cluster meaningful.

  • metric (string, or callable) – The distance metric to use. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in VALID_METRIC = [precomputed, euclidean,cosine]. If metric is precomputed, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. Default is euclidean.

  • densities (array [n_samples], default = None) – The logarithm of the density at each point. If provided, the following parameters are ignored: density_algo, k_max, D_thr.

  • err_densities (array [n_samples], default = None) – The uncertainty in the density estimation, obtained by computing the inverse of the Fisher information matrix.

  • k_hat (array [n_samples], default = None) – The optimal number of neighbors for which the condition of constant density holds.

  • nn_distances (array [n_samples, k_max+1]) – Distances to the k_max neighbors of each points.

  • nn_indices (array [n_samples, k_max+1]) – Indices of the k_max neighbors of each points.

  • affinity (string or callable, default 'precomputed') –

    How to construct the affinity matrix.
    • nearest_neighbors : construct the affinity matrix by computing a graph of nearest neighbors.

    • rbf : construct the affinity matrix using a radial basis function (RBF) kernel.

    • precomputed : interpret X as a precomputed affinity matrix.

    • precomputed_nearest_neighbors : interpret X as a sparse graph of precomputed nearest neighbors, and constructs the affinity matrix by selecting the n_neighbors nearest neighbors.

    • one of the kernels supported by pairwise_kernels().

  • density_algo (string, default = "PAk") – Define the algorithm to use as density estimator. It mast be one of the options allowed by VALID_DENSITY = [PAk, kNN].

  • k_max (int, default=1000) – This parameter is considered if density_algo is PAk or kNN, it is ignored otherwise. k_max set the maximum number of nearest-neighbors considered by the density estimator. If density_algo=PAk, k_max is used by the algorithm in the search for the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If density_algo=kNN, k_max set the number of neighbors to be used by the standard k-Nearest Neighbor algorithm. If the number of points in the sample N is less than the default value, k_max will be set automatically to the value N/2.

  • D_thr (float, default=23.92812698) – This parameter is considered if density_algo is PAk, it is ignored otherwise. Set the level of confidence in the PAk density estimator. The default value corresponds to a p-value of \(10^{-6}\) for a \(\chiˆ2\) distribution with one degree of freedom.

  • dim (int, default = None) – Intrinsic dimensionality of the sample. If dim is provided, the following parameters are ignored: dim_algo, blockAn, block_ratio, frac.

  • dim_algo (string, or callable, default="twoNN") – Method for intrinsic dimensionality calculation. If dim_algo is auto, dim is assumed to be equal to n_samples. If dim_algo is a string, it must be one of the options allowed by VALID_DIM = [auto, twoNN].

  • blockAn (bool, default=True) – This parameter is considered if dim_algo is twoNN, it is ignored otherwise. If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant dimensions as a function of the block size. This allows to study the stability of the estimation with respect to changes in the neighborhood size, which is crucial for ID estimations when the data lie on a manifold perturbed by a high-dimensional noise.

  • block_ratio (int, default=5) – This parameter is considered if dim_algo is twoNN, it is ignored otherwise. Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.

  • frac (float, default=1) – This parameter is considered if dim_algo is twoNN, it is ignored otherwise. Define the fraction of points in the data set used for ID calculation. By default the full data set is used.

  • n_jobs (int or None, optional (default=None)) – The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

labels_

The clustering labels assigned to each point in the data set.

Type

array [Nclus]

halos_

The clustering labels assigned to each point in the data set. Points identified as halos have label equal to zero.

Type

array [Nclus]

topography_

Let be Nclus the number of clusters, the topography consists in a Nclus × Nclus symmetric matrix, in which the diagonal entries are the heights of the peaks and the off-diagonal entries are the heights of the saddle points.

Type

array [Nclus, Nclus]

nn_distances_

Distances to the k_max neighbors of each points. The point itself is included in the array.

Type

array [n_samples, k_max+1]

nn_indices_

Indices of the k_max neighbors of each points. The point itself is included in the array.

Type

array [n_samples, k_max+1]

k_hat_

The optimal number of neighbors for which the condition of constant density holds.

Type

array [n_samples]

centers_

The clustering labels assigned to each point in the data set.

Type

array [Nclus]

dim_

Intrinsic dimensionality of the sample. If dim is not provided, dim_ is set to the number of features in the input file.

Type

int,

k_max_

The maximum number of nearest-neighbors considered by the procedure that returns the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If the number of points in the sample N is less than the default value, k_max_ will be set automatically to the value N/2.

Type

int

densities_

If not provided by the parameter densities, it is computed by using the PAk density estimator.

Type

array [n_samples]

err_densities_

The uncertainty in the density estimation. If not provided by the parameter densities, it is computed by using the PAk density estimator.

Type

array [n_samples]

Example

References

  1. d’Errico, E. Facco, A. Laio, A. Rodriguez, Information Sciences, Volume 560, June 2021, 476-492, https://www.sciencedirect.com/science/article/pii/S0020025521000116?dgcid=author (also available on arXiv: https://arxiv.org/abs/1802.10549)

fit(X, y=None)[source]

Fit the DPA clustering on the data.

Parameters
  • X (array [n_samples, n_samples] if metric == “precomputed”, or,) – [n_samples, n_features] otherwise The input samples. Similarities / affinities between instances if affinity='precomputed'.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Returns self.

Return type

object

fit_predict(X, y=None)[source]

Perform DPA clustering from features or distance matrix, and return cluster labels. :param X: Training instances to cluster, or distances between instances if

metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Parameters

y (Ignored) – Not used, present here for API consistency by convention.

Returns

labels – Cluster labels. Noisy samples are given the label -1.

Return type

ndarray, shape (n_samples,)

get_computed_params()[source]
get_dendrogram()[source]
Pipeline.DPA._DensityPeakAdvanced(densities, err_densities, k_hat, distances, indices, Z)[source]

Main function implementing the Density Peak Advanced clustering algorithm:

  • Automatic detection of cluster centers

  • Point assignament to clusters in order of decreasing g

  • Topography reconstruction: search of saddle points and cluster merging

Parameters
  • densities (array [n_samples]) – The logarithm of the density at each point.

  • err_densities (array [n_samples]) – The uncertainty in the density estimation, obtained by computing the inverse of the Fisher information matrix.

  • k_hat (array [n_samples]) – The optimal number of neighbors for which the condition of constant density holds.

  • distances (array [n_samples, k_max+1]) – Distances to the k_max neighbors of each points. The point itself is included in the array.

  • indices (array [n_samples, k_max+1]) – Indices of the k_max neighbors of each points. The point itself is included in the array.

  • Z (float, default = 1) – The number of standard deviations, which fixes the level of statistical confidence at which one decides to consider a cluster meaningful.

Pipeline.DPA.labels

The clustering labels assigned to each point in the data set.

Type

array [Nclus]

Pipeline.DPA.halos

The clustering labels assigned to each point in the data set. Points identified as halos have clustering lable equal to -1.

Type

array [Nclus]

Pipeline.DPA.topography

Let be Nclus the number of clusters, the topography consists in a Nclus × Nclus symmetric matrix, in which the diagonal entries are the heights of the peaks and the off-diagonal entries are the heights of the saddle points.

Type

array [Nclus, Nclus]

Pipeline.DPA.centers

The list of points identified as the centers of the Nclus statistically significant clusters.

Type

array [Nclus]

Pipeline.PAk module

class Pipeline.PAk.PointAdaptive_kNN(k_max=1000, D_thr=23.92812698, metric='euclidean', dim_algo='auto', nn_distances=None, nn_indices=None, blockAn=True, block_ratio=20, frac=1, dim=None, n_jobs=None)[source]

Bases: sklearn.base.BaseEstimator

Class definition for the Pointwise Adaptive k-NN density estimator.

Parameters
  • k_max (int, default=1000) – The maximum number of nearest-neighbors considered by the procedure that returns the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If the number of points in the sample N is less than the default value, k_max will be set automatically to the value N/2.

  • D_thr (float, default=23.92812698) – Set the level of confidence. The default value corresponds to a p-value of \(10^{-6}\) for a \(\chiˆ2\) distribution with one degree of freedom.

  • metric (string, or callable) – The distance metric to use. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in VALID_METRIC = [precomputed, euclidean,cosine]. If metric is precomputed, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. Default is euclidean.

  • dim_algo (string, or callable) – Method for intrinsic dimensionality calculation. If dim_algo is auto, dim is assumed to be equal to n_samples. If dim_algo is a string, it must be one of the options allowed by VALID_DIM = [auto, twoNN].

  • nn_distances (array [n_samples, k_max+1]) – Distances to the k_max neighbors of each points.

  • nn_indices (array [n_samples, k_max+1]) – Indices of the k_max neighbors of each points.

  • blockAn (bool, default=True) – This parameter is considered if dim_algo is twoNN, it is ignored otherwise. If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant dimensions as a function of the block size. This allows to study the stability of the estimation with respect to changes in the neighborhood size, which is crucial for ID estimations when the data lie on a manifold perturbed by a high-dimensional noise.

  • block_ratio (int, default=20) – This parameter is considered if dim_algo is twoNN, it is ignored otherwise. Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.

  • frac (float, default=1) – This parameter is considered if dim_algo is twoNN, it is ignored otherwise. Define the fraction of points in the data set used for ID calculation. By default the full data set is used.

  • dim (int) – Intrinsic dimensionality of the sample. If dim is provided, dim_algo is ignored.

  • n_jobs (int or None, optional (default=None)) – The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

dim_

Intrinsic dimensionality of the sample. If dim is not provided, dim_ is set to the number of features in the input file.

Type

int,

k_max_

The maximum number of nearest-neighbors considered by the procedure that returns the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If the number of points in the sample N is less than the default value, k_max_ will be set automatically to the value N/2.

Type

int

distances_

Distances to the k_max neighbors of each points.

Type

array [n_samples, k_max+1]

indices_

Indices of the k_max neighbors of each points.

Type

array [n_samples, k_max+1]

densities_

The logarithm of the density at each point.

Type

array [n_samples]

err_densities_

The uncertainty in the density estimation, obtained by computing the inverse of the Fisher information matrix.

Type

array [n_samples]

k_hat_

The optimal number of neighbors for which the condition of constant density holds.

Type

array [n_samples]

dc_

The radius of the optimal neighborhood for each point.

Type

array [n_sample]

Examples

>>> from DPA import PAk
>>> import numpy as np
>>> X = np.array([[1, 1], [2, 1], [1, 0],
...               [4, 7], [3, 5], [3, 6]])
>>> est = PAk.PointAdaptive_kNN().fit(X)
>>> est.distances_
array([[0.        , 1.        , 1.        ],
       [0.        , 1.        , 1.41421356],
       [0.        , 1.        , 1.41421356],
       [0.        , 1.41421356, 2.23606798],
       [0.        , 1.        , 2.23606798],
       [0.        , 1.        , 1.41421356]])
>>> est.indices_
array([[0, 1, 2],
       [1, 0, 2],
       [2, 0, 1],
       [3, 5, 4],
       [4, 5, 3],
       [5, 4, 3]])

References

  1. Rodriguez, M. d’Errico, E. Facco and A. Laio, Computing the free energy without collective variables. J. chemical theory computation 14, 1206–1215 (2018).

fit(X, y=None)[source]

Fit the PAk estimator on the data.

Parameters
  • X (array [n_samples, n_samples] if metric == precomputed, or,) – [n_samples, n_features] otherwise The input samples.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Returns self.

Return type

object

Pipeline.PAk._PointAdaptive_kNN(distances, indices, k_max=1000, D_thr=23.92812698, dim=None)[source]

Main function implementing the Pointwise Adaptive k-NN density estimator.

Parameters
  • distances (array [n_samples, k_max+1]) – Distances to the k_max neighbors of each points. The point is included.

  • indices (array [n_samples, k_max+1]) – Indices of the k_max neighbors of each points. The point is included.

  • k_max (int, default=1000) – The maximum number of nearest-neighbors considered by the procedure that returns the largest number of neighbors k_hat for which the condition of constant density holds, within a given level of confidence. If the number of points in the sample N is less than the default value, k_max will be set automatically to the value N/2.

  • D_thr (float, default=23.92812698) – Set the level of confidence. The default value corresponds to a p-value of \(10^{-6}\) for a \(\chiˆ2\) distribution with one degree of freedom.

  • dim (int) – Intrinsic dimensionality of the sample.

Pipeline.PAk.densities

The logarithm of the density at each point.

Type

array [n_samples]

Pipeline.PAk.err_densities

The uncertainty in the density estimation, obtained by computing the inverse of the Fisher information matrix.

Type

array [n_samples]

Pipeline.PAk.k_hat

The optimal number of neighbors for which the condition of constant density holds.

Type

array [n_samples]

Pipeline.PAk.dc

The radius of the optimal neighborhood for each point.

Type

array [n_sample]

Pipeline.twoNN module

Pipeline.twoNN._twoNearestNeighbors(distances, blockAn=True, block_ratio=20, frac=1)[source]

Main function for the TWO-NN estimator.

Parameters
  • distances (array [n_samples, k_max]) – Distances to the k_max neighbors of each points.

  • blockAn (bool, default=True) – If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant dimensions as a function of the block size. This allows to study the stability of the estimation with respect to changes in the neighborhood size, which is crucial for ID estimations when the data lie on a manifold perturbed by a high-dimensional noise.

  • block_ratio (int, default=20) – Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.

  • frac (float, default=1) – Define the fraction of points in the data set used for ID calculation. By default the full data set is used.

class Pipeline.twoNN.twoNearestNeighbors(metric='euclidean', blockAn=True, block_ratio=20, frac=1, n_jobs=None, affinity='precomputed')[source]

Bases: sklearn.base.BaseEstimator

Class definition for TWO-NN: Intrinsic dimension estimator by a minimal neighborhood information.

The TWO-NN estimator uses only the distances to the first two nearest neighbors of each point.

Parameters
  • metric (string, or callable) – The distance metric to use. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in VALID_METRIC = [precomputed, euclidean,cosine]. If metric is precomputed, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. Default is euclidean.

  • blockAn (bool, default=True) – If blockAn is True the algorithm perform a block analysis that allows discriminating the relevant dimensions as a function of the block size. This allows to study the stability of the estimation with respect to changes in the neighborhood size, which is crucial for ID estimations when the data lie on a manifold perturbed by a high-dimensional noise.

  • block_ratio (int, default=20) – Set the minimum size of the blocks as n_samples/block_ratio. If blockAn=False, block_ratio is ignored.

  • frac (float, default=1) – Define the fraction of points in the data set used for ID calculation. By default the full data set is used.

  • n_jobs (int or None, optional (default=None)) – The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

  • affinity (string or callable, default 'precomputed') –

    How to construct the affinity matrix.
    • nearest_neighbors : construct the affinity matrix by computing a graph of nearest neighbors.

    • rbf : construct the affinity matrix using a radial basis function (RBF) kernel.

    • precomputed : interpret X as a precomputed affinity matrix.

    • precomputed_nearest_neighbors : interpret X as a sparse graph of precomputed nearest neighbors, and constructs the affinity matrix by selecting the n_neighbors nearest neighbors.

    • one of the kernels supported by pairwise_kernels().

dim_

The intrinsic dimensionality

Type

int

References

  1. Facco, M. d’Errico, A. Rodriguez and A. Laio, Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci. reports 7, 12140 (2017)

fit(X, y=None)[source]

A reference implementation of a fitting function.

Parameters
  • X (array [n_samples, n_samples] if metric == precomputed, or,) – [n_samples, n_features] otherwise The input samples. Similarities / affinities between instances if affinity='precomputed'.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Returns self.

Return type

object