DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It defines a cluster as a maximum set of density-connected points. The algorithm groups areas with sufficiently high density into clusters and finds clusters of any shape in spatial datasets with noise. You can use the DBSCAN component to build clustering models. This topic describes how to configure the DBSCAN component.
Functional limitations
This component is supported only in Designer.
The supported computing engines are MaxCompute and Flink.
Configure component parameters in the UI
In Designer, you can configure the component parameters in the UI.
Tab | Parameter | Description |
Fields setting | ID column name | The name of the ID column. |
Vector column name | The name of the vector column. | |
Parameters setting | Neighborhood distance threshold | If point A is in the neighborhood of point B, the distance between them does not exceed this threshold. For more information, see Appendix 2: How to configure parameters. |
Threshold for the number of samples in a neighborhood | The minimum number of points required in the neighborhood of a point for it to be considered a core point. For more information, see Appendix 2: How to configure parameters. | |
Prediction result column name | The name of the prediction result column. | |
Distance measure | The type of distance measure used for clustering. The default is EUCLIDEAN. Valid values:
| |
Execution tuning | Number of workers | Used with the Memory per worker parameter. This parameter must be a positive integer from 1 to 9999. For more information, see Appendix 1: How to estimate resource usage. |
Memory per worker (MB) | The value must be in the range of 1024 MB to 64 × 1024 MB. For more information, see Appendix 1: How to estimate resource usage. |
Appendix 1: How to estimate resource usage
Use the following information to estimate resource usage.
How is the memory size of each node estimated?
To calculate the required memory for each worker, multiply the input data size by 15.
For example, if the input data is 1 GB, configure the memory for each worker to 15 GB.
How do I estimate the number of nodes?
Because of communication overhead, the speed of a distributed training task first increases and then decreases as the number of workers grows. If you observe that the training task slows down after you add more workers, stop adding workers.
What is the maximum data volume supported by the algorithm?
Fewer than 1 million data points and fewer than 200 dimensions.
NoteIf the data volume exceeds this range, group the data and then run the DBSCAN algorithm on each group separately.
Why does the cluster centroid have an ID of 2147483648?
Because the data point is an outlier and does not belong to any cluster.
Appendix 2: How to configure parameters
The two most common parameters for the DBSCAN component are the minimum number of samples in a neighborhood (minPoints) and the neighborhood distance threshold (epsilon). Configure these parameters as follows:
If you observe too many clusters and want to reduce the number, increase the minPoints value first, and then decrease the epsilon value.
If you observe too few clusters and want to increase the number, decrease the minPoints value first, and then increase the epsilon value.