Optimizing Multi-Stage Language Models for Effective Japanese Legal Document Retrieval

Trung Quang Hoang; Hoang Le Trung; Phuc Nguyen Van Hoang; Hieu Quang Huu

doi:10.48084/etasr.12111

Authors

Trung Quang Hoang VJ Technologies, Da Nang City, Vietnam
Hoang Le Trung VJ Technologies, Da Nang City, Vietnam
Phuc Nguyen Van Hoang VJ Technologies, Da Nang City, Vietnam
Hieu Quang Huu AJ Technologies, Nagoya City, Japan | VJ Technologies, Da Nang City, Vietnam

Volume: 15 | Issue: 5 | Pages: 27672-27681 | October 2025 | https://doi.org/10.48084/etasr.12111

Received: 13 May 2025 | Revised: 24 June 2025, 12 July 2025, and 17 July 2025 | Accepted: 1 August 2025 | Online: 28 August 2025

Corresponding author: Trung Quang Hoang

Abstract

Efficient text retrieval is critical for applications such as legal document analysis, especially within specialized domains such as Japanese legal systems. Existing methods often underperform in these scenarios: conventional BM25-based systems fail to capture nuanced legal expressions and formal sentence structures common in Japanese case law, resulting in low recall for relevant precedents. Consequently, tailored solutions are required. This study proposes a novel two-phase retrieval pipeline that replaces the sparse components (e.g., BM25+, TF–IDF) found in hybrid architectures, such as CoCondenser, instead operating end-to-end with only dense language models. This pipeline applies progressive fine-tuning—beginning with masked language model pretraining and followed by contrastive learning with hard negative mining—to iteratively improve accuracy on legal-domain queries. To ensure transparency and clear comparison, this study assesses two variants: an LM-only version (using both off-the-shelf and fine-tuned models) and a hybrid version that reintegrates BM25+, allowing for quantifying the impact of sparse components on retrieval performance. On a Japanese legal dataset, the proposed approach achieved state-of-the-art performance, yielding a 5.74% improvement in Recall@10 and an 11% gain in nDCG@10 over the strongest baseline, while remaining competitive on the MS-MARCO benchmark. To further enhance robustness and adaptability, an ensemble model integrated multiple retrieval strategies, yielding superior outcomes across diverse tasks. This work sets new standards for text retrieval in both domain-specific and general contexts, offering a comprehensive solution for handling complex queries in legal and multilingual environments.

Keywords:

two-phase, text retrieval, ensemble

Downloads

Download data is not yet available.

References

G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, vol. 24, no. 5, pp. 513–523, Jan. 1988.

S. Robertson, H. Zaragoza, and M. Taylor, "Simple BM25 extension to multiple weighted fields," in Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, DC, USA, Nov. 2004, pp. 42–49.

S. E. Robertson and S. Walker, "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval," in SIGIR ’94, B. W. Croft and C. J. Van Rijsbergen, Eds. Springer London, 1994, pp. 232–241.

S. Khalid, S. Khusro, I. Ullah, and G. Dawson-Amoah, "On The Current State of Scholarly Retrieval Systems," Engineering, Technology & Applied Science Research, vol. 9, no. 1, pp. 3863–3870, Feb. 2019.

S. Khalid and S. Wu, "Supporting Scholarly Search by Query Expansion and Citation Analysis," Engineering, Technology & Applied Science Research, vol. 10, no. 4, pp. 6102–6108, Aug. 2020.

Y. Lv and C. Zhai, "Lower-bounding term frequency normalization," in Proceedings of the 20th ACM international conference on Information and knowledge management, Oct. 2011, pp. 7–16.

L. Gao and J. Callan, "Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval," in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022, pp. 2843–2853.

A. Sil et al., "PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Toronto, Canada, 2023, pp. 51–62.

Q. Jin, A. Shin, and Z. Lu, "LADER: Log-Augmented DEnse Retrieval for Biomedical Literature Search," in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, Jul. 2023, pp. 2092–2097.

S. Zhuang et al., "Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation." arXiv, 2022.

W. Sun et al., "Learning to Tokenize for Generative Retrieval," Advances in Neural Information Processing Systems, vol. 36, pp. 46345–46361, Dec. 2023.

T. Gao, X. Yao, and D. Chen, "SimCSE: Simple Contrastive Learning of Sentence Embeddings," in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 6894–6910.

L. Xiong et al., "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval." arXiv, Oct. 20, 2020.

J. Niklaus, V. Matoshi, P. Rani, A. Galassi, M. Stürmer, and I. Chalkidis, "LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain," in Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 2023, pp. 3016–3054.

H. Li, Y. Shao, Y. Wu, Q. Ai, Y. Ma, and Y. Liu, "LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset," in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, Jul. 2024, pp. 2251–2260.

R. Hadsell, S. Chopra, and Y. LeCun, "Dimensionality Reduction by Learning an Invariant Mapping," in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR’06), New York, NY, USA, 2006, vol. 2, pp. 1735–1742.

Q. H. Trung, N. V. H. Phuc, L. T. Hoang, Q. H. Hieu, and V. N. L. Duy, "Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval." arXiv, Dec. 03, 2024.

T. Nguyen et al., "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset," presented at the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona Spain, 2016, vol. 1773.

S. Khalid, S. Wu, and F. Zhang, "A multi-objective approach to determining the usefulness of papers in academic search," Data Technologies and Applications, vol. 55, no. 5, pp. 734–748, Oct. 2021.

N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3980–3990.

Y. Yang et al., "Multilingual Universal Sentence Encoder for Semantic Retrieval," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020, pp. 87–94.

V. Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6769–6781.

Y. Sasazawa, K. Yokote, O. Imaichi, and Y. Sogawa, "Text Retrieval with Multi-Stage Re-Ranking Models." arXiv, 2023.

Y. Tay et al., "Transformer Memory as a Differentiable Search Index," Advances in Neural Information Processing Systems, vol. 35, pp. 21831–21843, Dec. 2022.

Optimizing Multi-Stage Language Models for Effective Japanese Legal Document Retrieval

Authors

Abstract

Keywords:

Downloads

References

Downloads

How to Cite

Metrics

License