Abstract:A fundamental and difficult problem in cluster analysis is the determination of the "true" number of clusters in a dataset. The common trail-and-error method generally depends on certain clustering algorithms and is inefficient when processing large datasets. In this paper, a hierarchical method is proposed to get rid of repeatedly clustering on large datasets. The method firstly obtains the CF (clustering feature) via scanning the dataset and agglomerative generates the hierarchical partitions of dataset, then a curve of the clustering quality w.r.t the varying partitions is incrementally constructed. The partitions corresponding to the extremum of the curve is used to estimate the number of clusters finally. A new validity index is also presented to quantify the clustering quality, which is independent of clustering algorithm and emphasis on the geometric features of clusters, handling efficiently the noisy data and arbitrary shaped clusters. Experimental results on both real world and synthesis datasets demonstrate that the new method outperforms the recently published approaches, while the efficiency is significantly improved.