Optimizing Multi-Stage Language Models for Effective Japanese Legal Document Retrieval
Received: 13 May 2025 | Revised: 24 June 2025, 12 July 2025, and 17 July 2025 | Accepted: 1 August 2025 | Online: 28 August 2025
Corresponding author: Trung Quang Hoang
Abstract
Efficient text retrieval is critical for applications such as legal document analysis, especially within specialized domains such as Japanese legal systems. Existing methods often underperform in these scenarios: conventional BM25-based systems fail to capture nuanced legal expressions and formal sentence structures common in Japanese case law, resulting in low recall for relevant precedents. Consequently, tailored solutions are required. This study proposes a novel two-phase retrieval pipeline that replaces the sparse components (e.g., BM25+, TF–IDF) found in hybrid architectures, such as CoCondenser, instead operating end-to-end with only dense language models. This pipeline applies progressive fine-tuning—beginning with masked language model pretraining and followed by contrastive learning with hard negative mining—to iteratively improve accuracy on legal-domain queries. To ensure transparency and clear comparison, this study assesses two variants: an LM-only version (using both off-the-shelf and fine-tuned models) and a hybrid version that reintegrates BM25+, allowing for quantifying the impact of sparse components on retrieval performance. On a Japanese legal dataset, the proposed approach achieved state-of-the-art performance, yielding a 5.74% improvement in Recall@10 and an 11% gain in nDCG@10 over the strongest baseline, while remaining competitive on the MS-MARCO benchmark. To further enhance robustness and adaptability, an ensemble model integrated multiple retrieval strategies, yielding superior outcomes across diverse tasks. This work sets new standards for text retrieval in both domain-specific and general contexts, offering a comprehensive solution for handling complex queries in legal and multilingual environments.
Keywords:
two-phase, text retrieval, ensembleDownloads
References
G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, vol. 24, no. 5, pp. 513–523, Jan. 1988.
S. Robertson, H. Zaragoza, and M. Taylor, "Simple BM25 extension to multiple weighted fields," in Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, DC, USA, Nov. 2004, pp. 42–49.
S. E. Robertson and S. Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval," in SIGIR ’94, B. W. Croft and C. J. Van Rijsbergen, Eds. Springer London, 1994, pp. 232–241.
S. Khalid, S. Khusro, I. Ullah, and G. Dawson-Amoah, "On The Current State of Scholarly Retrieval Systems," Engineering, Technology & Applied Science Research, vol. 9, no. 1, pp. 3863–3870, Feb. 2019.
S. Khalid and S. Wu, "Supporting Scholarly Search by Query Expansion and Citation Analysis," Engineering, Technology & Applied Science Research, vol. 10, no. 4, pp. 6102–6108, Aug. 2020.
Y. Lv and C. Zhai, "Lower-bounding term frequency normalization," in Proceedings of the 20th ACM international conference on Information and knowledge management, Oct. 2011, pp. 7–16.
L. Gao and J. Callan, "Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval," in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022, pp. 2843–2853.
A. Sil et al., "PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Toronto, Canada, 2023, pp. 51–62.
Q. Jin, A. Shin, and Z. Lu, "LADER: Log-Augmented DEnse Retrieval for Biomedical Literature Search," in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, Jul. 2023, pp. 2092–2097.
S. Zhuang et al., "Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation." arXiv, 2022.
W. Sun et al., "Learning to Tokenize for Generative Retrieval," Advances in Neural Information Processing Systems, vol. 36, pp. 46345–46361, Dec. 2023.
T. Gao, X. Yao, and D. Chen, "SimCSE: Simple Contrastive Learning of Sentence Embeddings," in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 6894–6910.
L. Xiong et al., "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval." arXiv, Oct. 20, 2020.
J. Niklaus, V. Matoshi, P. Rani, A. Galassi, M. Stürmer, and I. Chalkidis, "LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain," in Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 2023, pp. 3016–3054.
H. Li, Y. Shao, Y. Wu, Q. Ai, Y. Ma, and Y. Liu, "LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset," in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, Jul. 2024, pp. 2251–2260.
R. Hadsell, S. Chopra, and Y. LeCun, "Dimensionality Reduction by Learning an Invariant Mapping," in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR’06), New York, NY, USA, 2006, vol. 2, pp. 1735–1742.
Q. H. Trung, N. V. H. Phuc, L. T. Hoang, Q. H. Hieu, and V. N. L. Duy, "Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval." arXiv, Dec. 03, 2024.
T. Nguyen et al., "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset," presented at the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona Spain, 2016, vol. 1773.
S. Khalid, S. Wu, and F. Zhang, "A multi-objective approach to determining the usefulness of papers in academic search," Data Technologies and Applications, vol. 55, no. 5, pp. 734–748, Oct. 2021.
N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3980–3990.
Y. Yang et al., "Multilingual Universal Sentence Encoder for Semantic Retrieval," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020, pp. 87–94.
V. Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6769–6781.
Y. Sasazawa, K. Yokote, O. Imaichi, and Y. Sogawa, "Text Retrieval with Multi-Stage Re-Ranking Models." arXiv, 2023.
Y. Tay et al., "Transformer Memory as a Differentiable Search Index," Advances in Neural Information Processing Systems, vol. 35, pp. 21831–21843, Dec. 2022.
Downloads
How to Cite
License
Copyright (c) 2025 Trung Quang Hoang, Hoang Le Trung, Phuc Nguyen Van Hoang, Hieu Quang Huu

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.