Abstract:With the rapid advancement of deep learning and computer vision, grayscale image colorization has evolved from traditional handcrafted feature-based methods to data-driven deep neural network paradigms. However, existing evaluation systems for grayscale image colorization models face the following two challenges: First, due to the limitations of evaluation metrics and the highly ill-posed nature of the colorization task, traditional quantitative metrics such as PSNR, SSIM, and FID cannot effectively quantify the performance of grayscale image colorization models. Second, it is time-consuming, laborious, and infeasible to conduct qualitative analyses through large-scale subjective experiments. To address these issues, a new evaluation method for grayscale image colorization models based on hard sample mining is proposed. The method aims to efficiently identify representative samples for model comparison through multi-dimensional evaluation (including image quality, aesthetics epression, and color difference), and then conduct a controlled small-scale subjective experiment to reliably compare different models. Subsequently, the advantages and shortcomings of the models are revealed. Experimental results show that the proposed method can efficiently and accurately find hard samples, and reveal the strengths and weaknesses of the models while drastically reducing the scale of subjective experiments, providing a new paradigm for grayscale image colorization model evaluation and indicating the direction for model optimization.