DIPS: Efficient Multi-Criteria Cantonese Word Segmentation

Live demo running on your device

Input:  
Coarse: 
Fine:   
DIPS:   

Abstract

We present DIPS, a novel approach to Cantonese word segmentation that addresses the challenges of balancing model size, inference speed, and accuracy while accommodating diverse segmentation standards and supporting named entity recognition. Our method combines fine-tuning, knowledge distillation, structured pruning, and quantization to create a compact and efficient model. DIPS introduces a new segmentation scheme that captures nuanced word boundaries, enabling flexible multi-criteria segmentation and entity identification. Experimental results show that DIPS achieves comparable performance to ELECTRA Small on standard benchmarks while being 17 times smaller (3 MB) and 3 times faster (0.13 ms/token on CPU). DIPS offers a practical solution for Cantonese NLP tasks, particularly in resource-constrained environments and real-time applications. The model is available as open-source libraries published on PyPI and NPM. Source code and pretrained models are available on GitHub.

Introduction

Chinese word segmentation faces two major challenges: balancing size, speed, and accuracy, while also accommodating diverse segmentation standards.

To illustrate these challenges, let's examine the popular HanLP toolkit, which offers 3 types of segmentation models:

  1. Accurate but heavy transformer models (50MB to 400MB)
  2. Less accurate but faster convolutional models (around 8MB)
  3. Lightest and fastest dictionary segmentation (typically less than 1MB), but with the lowest accuracy

While transformer models are the default in HanLP due to their versatility, their size and computational requirements hinder widespread adoption. For instance, HanLP demos for these models run on servers and developers are required to request an API key for sustained access. Consequently, lighter-weight approaches like Hidden Markov Models remain the most popular. At around 5MB in size, Jieba's success stems from its superior performance compared to dictionary segmentation while maintaining a manageable footprint.

The challenge is further compounded by diverse segmentation standards, which fragment the limited training data and necessitate multiple models for different standards. HanLP trains a model for each corpus (e.g., CTB9, MSR, PKU) as they feature distinct segmentation standards. Recent work by He et al. (2018) and Chou et al. (2023) trains a single model for all standards, approaching multi-criteria Chinese word segmentation as a multi-task learning problem. However, users often struggle to choose the most suitable standard, cannot mix and match different standards, and lack insight into the differences between them.

These challenges are even more pronounced in Cantonese NLP, where training data is much more limited. To address these issues, we propose DIPS, a multi-criteria Cantonese word segmentation toolkit that is slim, fast, and accurate. DIPS achieves comparable performance as ELECTRA Small while maintaining a size comparable to Jieba (only ~3MB). Benefiting from a combination of model compression techniques, its inference speed is more than 3x faster than ELECTRA Small, making DIPS a practical segmentation solution on web and mobile.

Moreover, DIPS adopts a multi-criteria segmentation format recently proposed by Lam et al. (2024). This format naturally encompasses both coarse and fine-grained segmentation levels while also marking word category boundaries, morpheme boundaries, and named entities. As a result, DIPS can accomodate for various segmentation needs in natural language processing, information retrieval, and language education.

Literature Review

Chinese and Cantonese Word Segmentation

Toasty News conducted a comprehensive survey of existing Cantonese word segmentation tools, comparing their performance across various Cantonese and Chinese datasets. For our analysis, we'll focus on two particularly insightful datasets in the Universal Dependencies format for Hong Kong Cantonese and Written Chinese (Wong et al., 2017). These datasets are especially valuable as they showcase both languages in formal and informal contexts, providing a balanced representation. Moreover, their parallel nature allows for a direct comparison of word segmentation capabilities between Cantonese and Chinese, offering a clear picture of each model's cross-lingual performance.

Regarding training datasets, Chinese boasts a rich selection of segmentation corpora. Among these, four datasets from the Second International Chinese Word Segmentation Bakeoff (Emerson, 2005) are particularly noteworthy. These datasets, sourced from Academia Sinica (AS), City University of Hong Kong (CU), Peking University (PKU), and Microsoft Research (MSR), serve as standard training datasets for Chinese word segmentation research and development. In contrast, the landscape for Cantonese is notably sparse compared to Chinese. The Hong Kong Cantonese Corpus (HKCanCor) stands out as the sole significant open-source Cantonese corpus with word segmentation. For comparison, the largest Chinese dataset AS contains 4,904,120 tokens and the smallest Chinese dataset PKU contains 991,815 tokens. In comparison, the HKCanCor dataset only contains 153,654 tokens, which is 6x smaller than PKU and 30x smaller than AS. Due to the relative abundance of Chinese data, prior work usually jointly train on both languages to increase the training size.

Table 1. F1 Score of various segmentation tools on UD Yue HK and UD Zh HK datasets
Tool UD Yue HK UD Zh HK
Toasty News ELECTRA HK Small 94.68% 92.77%
Toasty News ELECTRA HK Base 94.62% 93.30%
Toasty News ELECTRA HKT Base 94.04% 93.27%
Toasty News ELECTRA HKT Small 93.89% 93.14%
CKIP BERT Base 89.41% 92.70%
cantoseg 86% 84%
PyCantonese 86% 84%
CKIP BERT Tiny 85.02% 92.07%
pkuseg 83% 92%
jieba 82% 84%

Table 1 shows five groups of models:

  1. Transformer models trained on only Cantonese segmentation datasets (fine-tuned on HKCanCor)
  2. Transformer models trained on both Cantonese and Chinese segmentation datasets (fine-tuned on HKCanCor, AS, and CityU)
  3. Transformer models trained on only Chinese segmentation datasets
  4. HMM models
  5. Dictionary segmentation models

Among the models tested, transformers trained exclusively on Cantonese segmentation datasets (group 1) exhibit the highest performance on UD Yue HK. They are closely followed by models trained on mixed Cantonese and Chinese datasets (group 2). In contrast, transformers trained solely on Chinese segmentation datasets (group 3) show a marked decrease in performance when applied to Cantonese. Cantoseg, an HMM model, performs on par with dictionary segmentation, owing to its incorporation of dictionary data from HKCanCor. It's worth noting that transformer models trained on Cantonese data demonstrate good generalization to Chinese, likely due to their pretraining on substantial amounts of Chinese text.

A closer look at segmentation errors reveals that some are false alarms due to differing standards between datasets. For instance, HKCanCor splits proper names (e.g., 黎 lai4 + 明 ming4, "Leon Lai"), while UD does not. These differences, as detailed by Lam et al. (2024), complicate accurate model evaluation. Furthermore, existing multi-criteria methods prove ineffective in this context, as HKCanCor and UD Yue are the only Cantonese datasets available. When UD Yue reserved for testing, HKCanCor is the sole Cantonese dataset available for training.

Efficient Transformer Inference

While transformer models demonstrate superior performance, their substantial memory and computational requirements impede widespread adoption. To address this challenge, researchers have developed various techniques for model compression and accelerated inference. Three popular approaches are knowledge distillation, quantization, and pruning.

Sanh et al. (2020) first applied knowledge distillation to BERT in their work on DistilBERT, which involves training a smaller model to emulate a larger one. This technique successfully reduced the size of a BERT Base model by 40% while retaining 97% of its language understanding capabilities and achieving 60% faster inference.

Quantization, another effective technique, reduces the precision of model parameters. It can be broadly categorized into two types: Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) (Hu et al., 2022). QAT methods like Q8BERT (Zafrir et al., 2019) quantize BERT's weights and activations to 8-bits by introducing quantization error during fine-tuning. TernaryBERT (Zhang et al., 2020) uses 3-bit weights and 8-bit activations, overcoming performance degradation through knowledge distillation. BinaryBERT (Bai et al., 2020) further refines TernaryBERT by binarizing weights and quantizing activations to 4-bit, though this process is time and resource-intensive.

PTQ methods generally do not require extra fine-tuning or retraining to compensate for quantization errors. Hu et al. (2022) evaluated three PTQ approaches on BERT-Base and BERT-Large: Linear Quantization (LQ), Analytical Clipping for Integer Quantization (ACIQ), and Outlier Channel Splitting (OCS). They found that OCS could quantize BERT-Base and BERT-Large to 3-bits while retaining 98% and 96% of their performance on the GLUE benchmark, respectively. Popular machine learning libraries have also developed their own quantization methods. For instance, GGML, a tensor library for transformer models, supports at least 17 post-training quantization methods with various size and performance trade-offs. The Q4_0 method, a 4-bit round-to-nearest quantization, is generally considered the most intensive quantization type with acceptable performance. It can compress a 7B transformer model from 26GB in F32 full-precision to 3.5GB at the cost of a 0.2499 perplexity increase.

Pruning, another compression technique, was pioneered for modern neural networks by Han et al. (2015). Their method involved removing unimportant connections and retraining the network to fine-tune the remaining connections' weights. This approach reduced AlexNet's parameters by a factor of 9x on the ImageNet dataset without sacrificing accuracy. However, unstructured sparsity patterns are challenging to support efficiently on modern CPUs and GPUs. Consequently, structured pruning, which removes blocks of weights at once, has gained popularity (Blalock et al., 2020). For transformer models like BERT, one straightforward structured pruning method is dropping entire transformer layers. Sajjad et al. (2022) systematically evaluated different layer-dropping strategies on several BERT models and successfully pruned up to 40% of the model while maintaining up to 98% of the original performance by simply removing the top half of the transformer layers.

Methodology

Our methodology for efficient multi-criteria Cantonese word segmentation involves 3 key steps: fine-tuning, model distillation, and quantization.

First, we approach word segmentation as a token classification task. We fine-tune Cantonese ELECTRA models on a new version of the HKCanCor corpus, annotated by Lam et al. (2024). This annotation scheme introduces a more nuanced approach to word boundaries, using spaces ( ), dashes (-), and pipes (|) to denote different types of separations:

To accommodate this refined annotation, we extend the traditional BI (Beginning, Intermediate) scheme. Our new DIPS scheme further categorizes the Beginning character into three types: Dashes, Pipes, and Spaces, corresponding to the separator preceding it. This allows our models to capture the nuanced segmentation information present in the new annotation.

At inference time, our model offers users a flexible range of segmentation options. Building upon the DIPS scheme, we provide both coarse and fine-grained segmentation levels. Users can choose to split results using only spaces for coarse segmentation or utilize all separator types for more fine-grained results. Furthermore, the fully annotated output remains accessible, enabling advanced applications such as named entity extraction from pipe-separated tokens. This versatility ensures that our model can adapt to various user needs and downstream tasks.

Here's an example to illustrate our DIPS segmentation scheme. Note how the named entity 張先生 (Mr. Zoeng) is separated by a pipe and the affectionate prefix 阿 is attached to the entity via a dash.
Sentence: 阿張先生嗰時好nice㗎 DIPS: 阿-張|先生 嗰-時 好 nice 㗎 Fine: 阿 張 先生 嗰 時 好 nice 㗎 Coarse: 阿張先生 嗰時 好 nice 㗎 Labels: SDPISDSSIIIS Gloss: Mr. Zoeng was so nice at the time.

Second, we combine knowledge distillation with structured pruning to achieve significant model compression while minimizing performance degradation. We distill an ELECTRA Base model into a smaller ELECTRA model with half of its layers removed. The number of layers to drop was determined empirically based on the training loss during fine-tuning.

Finally, we apply Q4_0 quantization using GGML to further compress the model and increase inference speed. This step prepares the model for efficient deployment across various platforms, enabling fast CPU and GPU inference.

Evaluation

To evaluate our model's performance, we use the standard token F1 score. To accurately evaluate our model on the UD datasets that only provide space-segmented tokens, we leverage the flexibility of our DIPS scheme to dynamically accommodate the UD standard on a token level during evaluation. This works by first creating a coarse-to-fine token mapping for our segmentation result, converting matched coarse tokens in the UD sentence to our fine tokens, and finally comparing the transformed ground truth with our fine tokens. Here's an example to illustrate the evaluation process:

Sentence: 阿張先生嗰時好nice㗎
Hypothesis: 阿-張|先生 嗰-時 好 nice 㗎 Hypothesis (Fine): 阿 張 先生 嗰 時 好 nice 㗎 Hypothesis (Coarse): 阿張先生 嗰時 好 nice 㗎
Coarse to Fine: 阿張先生 => 阿 張 先生, 嗰時 => 嗰 時
Reference: 阿張先生 嗰 時 好 nice 㗎 Reference (Mapped): 阿 張 先生 嗰 時 好 nice 㗎
Observe how the hypothesis aligns with the reference, despite neither the coarse nor fine versions matching it exactly. This discrepancy arises because the hypothesis combines elements of both segmentation levels. By applying a coarse-to-fine mapping to the reference, we can more accurately evaluate the model's performance without unfairly penalizing it for segmentations that are technically correct but differ in granularity.

Evaluation

We investigate the effect of multiple axes on the performance vs size tradeoff:

Our experimental process consists of 4 steps:

  1. We first fine-tune pretrained Cantonese ELECTRA Small, Base, and Large models on HKCanCor for token classification for 3 epochs.
  2. Based on performance on the UD datasets, we select ELECTRA Base as the source model for knowledge distillation. For simplicity, we do not use the output logits during distillation and instead just fine-tune ELECTRA Small on parts of the Cantonese Wikipedia segmented by ELECTRA Base along with the ground truth labels for HKCanCor.
  3. We truncate the vocabulary list from 30,000 to 8,287 tokens, which includes all the characters from HKCanCor in addition to the top characters in the Cantonese Wikipedia that account for 99.99% of the total characters in the wiki.
  4. Lastly, we quantize the model with truncated vocabulary using the Q4_0 quantization method to arrive at our final model that's slim, fast, and accurate.

Figure 1 shows the performance vs size for the various configurations. Figure 2 shows the performance vs latency for the same set of configurations, tested on the M1 Max CPU with 10 cores. Table 1 shows the full statistics on both UD datasets for all configurations tested.

Figure 1. F1 Score on UD Yue vs Model Size
Figure 2. F1 Score on UD Yue vs Latency
Table 2. Performance, Size, and Latency of Various Configurations.
Model Size Layers Dropped Distilled Truncated Vocab Quantization Size (MB) Latency (ms/tok) UD Yue F1 UD Zh F1
Large 0 No No None 1272 5.51 0.9620 0.9571
Base 0 No No None 414 2.05 0.9663 0.9533
Small 0 No No None 51 0.40 0.9541 0.9401
Small 6 No No None 33 0.22 0.9470 0.9267
Small 7 No No None 30 0.17 0.9388 0.9147
Small 8 No No None 27 0.14 0.9148 0.8557
Small 9 No No None 24 0.11 0.9021 0.8257
Small 6 Yes No None 33 0.25 0.9616 0.9454
Small 6 Yes Yes None 23 0.22 0.9578 0.9401
Small 6 Yes Yes Q8_0 6 0.15 0.9578 0.9403
Small 6 Yes Yes Q4_0 3 0.13 0.9560 0.9382

Analyzing the trends in Table 2, we can observe several key points:

  1. Model size significantly impacts performance and latency. The Large model achieves high F1 scores but at the cost of large size (1272 MB) and high latency (5.51 ms/tok).
  2. Layer dropping proves effective in reducing model size and latency while maintaining performance, up to a point. Dropping up to 7 out of 12 layers yields minimal performance loss, but removing an 8th layer results in a significant performance drop.
  3. Distillation proves highly effective. The distilled Small model with 6 layers dropped retains 99.5% of the Base model performance while being 12x smaller and 8x faster.
  4. Vocabulary truncation further reduces model size by 10MB with little impact on performance.
  5. Quantization dramatically decreases model size (from 23 MB to 3 MB for Q4_0) and improves latency, with only a minor impact on F1 scores (less than 0.2% decrease).
  6. Compared to ELECTRA Small (row 3), our model (last row) reduces the size by 17x to 3 MB, reduces latency by 3x to 0.13 ms/token, and achieves comparable F1 scores on UD datasets.

Figure 3 complements Figure 1 and shows similar performance vs size tradeoff on the HKCanCor test split. The notable exception is that ELECTRA Small outperforms Base, which is likely due to the smaller model overfitting during fine-tuning.

Figure 3. Accuracy on HKCanCor Test vs Model Size

Conclusion

In conclusion, our DIPS model for Cantonese word segmentation demonstrates a remarkable balance between performance, size, and latency. Through a combination of techniques including layer dropping, distillation, vocabulary truncation, and quantization, we have developed a model that is 17 times smaller and 3 times faster than the ELECTRA Small model, while maintaining comparable performance. This significant reduction in size and improvement in speed makes our model highly suitable for deployment in resource-constrained environments and real-time applications.

Try our model with pip install pydips or npm install dips.js! Mobile implementations are in the works - stay tuned!

Acknolwedgement

We would like to thank Charles Lam, Chaak Ming Lau, and Jackson L. Lee for their work on creating the first multi-tiered cantonese word segmentation corpus. Without their work, this project will not be feasible. We would also like to thank Toasty News for open sourcing their pretrained models, scripts, and evaluation results, which provided a strong foundation for this project.

Reference

Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. What is the State of Neural Network Pruning?
Tzu Hsuan Chou, Chun-Yi Lin, and Hung-Yu Kao. 2023. Advancing Multi-Criteria Chinese Word Segmentation Through Criterion Classification and Denoising. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6460–6476, Toronto, Canada. Association for Computational Linguistics.
Thomas Emerson. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.
Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, pages 1135–1143, Cambridge, MA, USA. MIT Press.
Han He, Lei Wu, Hua Yan, Zhimin Gao, Yi Feng, and George Townsend. 2018. Effective Neural Solution for Multi-Criteria Word Segmentation. arXiv:1712.02856 [cs].
Ting Hu, Christoph Meinel, and Haojin Yang. 2022. Empirical Evaluation of Post-Training Quantization Methods for Language Tasks. arXiv:2210.16621 [cs].
Charles Lam, Chaak-ming Lau, and Jackson L. Lee. 2024. Multi-Tiered Cantonese Word Segmentation. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 11993–12002, Torino, Italia. ELRA and ICCL.
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2023. On the effect of dropping layers of pre-trained transformer models. Comput. Speech Lang., 77(C).
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs].
Tak-sum Wong, Kim Gerdes, Herman Leung, and John Lee. 2017. Quantitative Comparative Syntax on the Cantonese-Mandarin Parallel Dependency Treebank. In Simonetta Montemagni and Joakim Nivre, editors, Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 266–275, Pisa, Italy. Linköping University Electronic Press.
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), pages 36–39. arXiv:1910.06188 [cs].
Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, and Qun Liu. 2020. TernaryBERT: Distillation-aware Ultra-low Bit BERT. arXiv:2009.12812 [cs, eess].
Building a Hongkongese Word Segmenter | by 出嚟食飯 | Medium.
ggerganov/llama.cpp · Discussions · GitHub.