Abstract:Mainstream methods for scene text detection often use complex networks with plenty of layers to improve detection accuracy, which requires high computational costs and large storage space, thus making them difficult to deploy on embedded devices with limited computing resources. Knowledge distillation assists in training lightweight student networks by introducing soft target information related to teacher networks, thus achieving model compression. However, existing knowledge distillation methods are mostly designed for image classification and extract the soft probability distributions from teacher networks as knowledge. The amount of information carried by such methods is highly correlated with the number of categories, resulting in insufficient information when directly applied to the binary classification task in text detection. To address the problem of scene text detection, this study introduces a novel concept of information entropy and proposes a knowledge distillation method based on mask entropy transfer (MaskET). MaskET combines information entropy with traditional knowledge distillation methods to increase the amount of information transferred to student networks. Moreover, to eliminate the interference of background information in images, MaskET only extracts the knowledge within the text area by adding mask operations. Experiments conducted on six public benchmark datasets, namely ICDAR 2013, ICDAR 2015, TD500, TD-TR, Total-Text and CASIA-10K, show that MaskET outperforms the baseline model and other knowledge distillation methods. For example, MaskET improves the F1 score of MobileNetV3-based DBNet from 65.3% to 67.2% on the CASIA-10K dataset.