JFinTEB: Japanese Financial Text Embedding Benchmark
Summary
JFinTEB introduces the first comprehensive benchmark for evaluating Japanese financial text embeddings, addressing a gap in domain-specific and language-specific evaluation resources. The benchmark includes retrieval and classification tasks evaluated across Japanese-specific, multilingual, and commercial embedding models, with datasets and evaluation framework publicly released.
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# JFinTEB: Japanese Financial Text Embedding Benchmark Source: https://arxiv.org/html/2604.15882 ###### Abstract We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well-defined financial text processing scenarios. The retrieval tasks leverage instruction-following datasets and financial text generation queries, while classification tasks cover sentiment analysis, document categorization, and domain-specific classification challenges derived from economic survey data. We conduct extensive evaluations across a wide range of embedding models, including Japanese-specific models of various sizes, multilingual models, and commercial embedding services. We publicly release JFinTEB datasets and evaluation framework at https://github.com/retarfi/JFinTEB to facilitate future research and provide a standardized evaluation protocol for the Japanese financial text mining community. This work addresses a critical gap in Japanese financial text processing resources and establishes a foundation for advancing domain-specific embedding research. Text Embeddings; Financial Domain; Benchmark Evaluation; Domain Adaptation; Text Mining; Information Retrieval; Japanese ††copyright:none††conference:Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 20–24, 2026; Melbourne, VIC, Australia††booktitle:Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20–24, 2026, Melbourne, VIC, Australia ## 1. Introduction Refer to captionFigure 1. Overview of JFinTEB benchmark. JFinTEB is a comprehensive benchmark for evaluating Japanese financial text embeddings, covering diverse tasks such as classification, retrieval, and clustering. Text embeddings provide a unified representation for textual data and underpin a wide range of text mining and natural language processing tasks. Embedding-based retrieval is a core component of modern information retrieval systems, including search, question answering, and recommendation. The development of comprehensive benchmarks has been crucial for advancing embedding quality, with established evaluations including MTEB for English (Muennighoff et al., 2023 (https://arxiv.org/html/2604.15882#bib.bib24)), MMTEB for multilingual contexts (Enevoldsen et al., 2025 (https://arxiv.org/html/2604.15882#bib.bib25)), and JMTEB for Japanese (Li et al., 2024 (https://arxiv.org/html/2604.15882#bib.bib5)). Recently, domain-specific benchmarks such as FinMTEB have demonstrated the importance of specialized evaluation for financial applications in English and Chinese (Tang and Yang, 2025 (https://arxiv.org/html/2604.15882#bib.bib22)). Despite Japan's position as a major financial market and the growing adoption of NLP technologies in Japanese financial institutions, no unified benchmark exists for evaluating Japanese financial text embeddings. This gap represents a significant limitation for developing and deploying embedding models in Japan's financial sector, where domain-specific language patterns, regulatory terminology, and cultural contexts require specialized evaluation. In particular, Japanese financial texts exhibit characteristics that make general-purpose benchmarks insufficient. Financial disclosures such as quarterly reports and securities filings use highly domain-specific terminology and formulaic phrasing that rarely appear in general Japanese corpora. Such linguistic phenomena often arise in scenarios where embeddings are directly used for retrieval, clustering, or zero-shot classification—settings common in financial information systems but underrepresented in existing benchmarks. These considerations underscore the necessity of a dedicated benchmark to systematically evaluate embedding models for Japanese finance. Existing benchmarks face limitations when applied to Japanese financial contexts. JMTEB (Li et al., 2024 (https://arxiv.org/html/2604.15882#bib.bib5)) focuses on general Japanese tasks but does not include financial applications, while FinMTEB (Tang and Yang, 2025 (https://arxiv.org/html/2604.15882#bib.bib22)) targets the financial domain but only in English and Chinese. Furthermore, financial text processing demands specialized capabilities including regulatory document analysis, market sentiment understanding, and industry classification—requirements not adequately addressed by current evaluation frameworks. Several related efforts have proposed financial NLP tasks and evaluations. NTCIR U4 (Kimura et al., 2025 (https://arxiv.org/html/2604.15882#bib.bib27)) addresses Japanese financial document understanding through table retrieval and extraction, and Hirano (Hirano, 2024 (https://arxiv.org/html/2604.15882#bib.bib28)) studies the performance of large language models on Japanese financial tasks. NTCIR FinNum (Chen et al., 2018 (https://arxiv.org/html/2604.15882#bib.bib26)) investigates numerical semantics in financial texts in English. While these works provide valuable task-specific insights, they are not designed to evaluate text embeddings under a unified, multi-task benchmarking framework. To address these limitations, we introduce JFinTEB (Japanese Financial Text Embedding Benchmark), the first comprehensive benchmark for evaluating Japanese financial text embeddings. Figure 1 (https://arxiv.org/html/2604.15882#S1.F1) illustrates the overall structure of our benchmark. Our benchmark comprises 11 carefully designed tasks spanning classification, retrieval, and clustering across diverse financial contexts, from regulatory documents to market sentiment analysis. We establish rigorous quality assurance protocols and evaluate 14 representative embedding models, providing baseline results and practical insights for model selection in Japanese financial applications. As a result, JFinTEB complements existing benchmarks by introducing financial-domain tasks in Japanese, filling a critical gap for both domestic applications and cross-lingual evaluation. Table 1 (https://arxiv.org/html/2604.15882#S1.T1) summarizes their scope compared to JFinTEB, highlighting how our resource complements prior work. Unlike prior benchmarks, JFinTEB focuses on the Japanese financial domain, which combines language-specific challenges with application scenarios that are critical for industry but missing in existing resources. Rather than constructing artificially difficult tasks, JFinTEB focuses on realistic and well-defined information needs observed in Japanese financial applications, where embeddings are commonly used for retrieval, clustering, and classification. Consequently, some retrieval tasks exhibit high performance under current models, highlighting the maturity of embedding methods in realistic financial settings. To our knowledge, JFinTEB is the first benchmark that systematically evaluates Japanese financial text embeddings across multiple tasks under a unified evaluation protocol. Our design enables reproducible evaluation of text embeddings under data distributions commonly encountered in practical information retrieval systems. The contributions of this work are threefold: (1) development of the first comprehensive Japanese financial text embedding benchmark with 11 validated tasks; (2) systematic evaluation of 14 embedding models including Japanese-specialized and multilingual approaches; and (3) public release of all datasets, evaluation code, and baseline results at https://github.com/retarfi/JFinTEB, to facilitate reproducible research in Japanese financial text mining. Table 1. Comparison with prior embedding benchmarks. Abbreviations: Cls = Classification, Ret = Retrieval, Clus = Clustering, STS = Semantic Textual Similarity, RR = Reranking, PC = Pair Classification, Summ = Summarization, BM = Bitext Mining. ## 2. JFinTEB Benchmark ### 2.1. Task Design and Dataset Construction #### 2.1.1. Classification Tasks We include several existing datasets: chABSA for aspect-based sentiment (Takahiro Kubo, 2018 (https://arxiv.org/html/2604.15882#bib.bib4)), three tasks derived from the Economy Watchers Survey (domain, sentiment, horizon) (Suzuki and Sakaji, 2025 (https://arxiv.org/html/2604.15882#bib.bib1)), MultiFin-ja for financial news headlines (Jørgensen et al., 2023 (https://arxiv.org/html/2604.15882#bib.bib9)), and Wikinews classification (Nishikawa et al., 2022 (https://arxiv.org/html/2604.15882#bib.bib6)). "Horizon" performs binary classification of Economy Watchers Survey comments into current-state and future-outlook categories, utilizing the inherent survey structure that collects assessments for both present and prospective economic conditions. In addition, we construct two new datasets: Industry 17 and Industry 33, where company descriptions from Japanese Wikipedia are aligned with official JPX industry categories at two granularities (17 and 33 sectors), enabling coarse- and fine-grained evaluation. Company pages of Tokyo Stock Exchange Prime Market listed firms are identified by parsing the listing information template in Japanese Wikipedia articles and extracting stock codes via automated matching with the JPX database, with no manual verification. Since JPX industry classifications assign each company to exactly one sector, no ambiguity arises in label assignment. Industry labels follow the official classification published by Japan Exchange Group (JPX)¹ https://www.jpx.co.jp/english/markets/statistics-equities/misc/01.html. Wikinews classification categorizes Wikinews articles into politics and economics domains, extracted from the broader categorical structure used in previous studies (Nishikawa et al., 2022 (https://arxiv.org/html/2604.15882#bib.bib6)). #### 2.1.2. Retrieval Tasks Our four retrieval tasks cover diverse information access needs. JaFIn (Tanabe et al., 2024 (https://arxiv.org/html/2604.15882#bib.bib3)) evaluates retrieval of financial FAQs, and PFMT (Hirano and Imajo, 2025 (https://arxiv.org/html/2604.15882#bib.bib2)) provides a multi-turn regulatory Q&A benchmark. In addition, we construct two retrieval datasets using automated procedures with no manual annotation. Wikinews retrieval matches news headlines with their corresponding articles, drawn from the same politics and economics categories described above. Wikipedia retrieval pairs company names—taken directly from Japanese Wikipedia article titles—with their corresponding descriptions, using the same set of Tokyo Stock Exchange Prime Market listed companies identified for Industry 17/33. These additions ensure coverage of news-driven and corporate information scenarios specific to the Japanese financial context. #### 2.1.3. Clustering Tasks We extend prior setups (Suzuki and Sakaji, 2025 (https://arxiv.org/html/2604.15882#bib.bib1)) by introducing "Reason", which groups comments into 13 economic reasoning categories. Unlike earlier work that included an "other" class, we redefine two frequent categories (trends in hires, employment type characteristics) to provide more balanced clustering. ### 2.2. Dataset Statistics and Availability Table 2 (https://arxiv.org/html/2604.15882#S2.T2) presents statistics for all tasks. Retrieval and clustering tasks use validation sets only for selecting evaluation configurations, with no model training; hence, the absence of training sets. All datasets curated in this study (Horizon, Wikinews (classification and retrieval), Industry 17, Industry 33, and Wikipedia-retrieval) are publicly available at https://github.com/retarfi/JFinTEB and contain no personally identifiable information. Table 2. JFinTEB Task Statistics. Chars indicates the median number of characters in the validation (val.) set. ### 2.3. Evaluation Methodology Following JMTEB (Li et al., 2024 (https://arxiv.org/html/2604.15882#bib.bib5)), we adopt standard evaluation protocols for classification (macro-F1), retrieval (NDCG@10), and clustering (V-measure). Our implementation builds directly on the JMTEB codebase with minor modifications to incorporate financial datasets, ensuring consistency and reproducibility across benchmarks. For all tasks, validation sets are used exclusively to select evaluation configurations, while test sets are held out for final reporting. For retrieval and clustering tasks, no model training is performed; validation data are used only to select evaluation settings, ensuring fair and reproducible comparisons across models. ### 2.4. Quality Assurance and Validation We validate task quality using two stability criteria to ensure reliable evaluation: Model Family Consistency: Using three embedding model families with different parameter scales—Multilingual E5 (small/large) (Wang et al., 2024 (https://arxiv.org/html/2604.15882#bib.bib11)), Ruri v3 (30M/310M) (Tsukagoshi and Sasano, 2024 (https://arxiv.org/html/2604.15882#bib.bib7)), and OpenAI text-embedding-3 (small/large)—we identify tasks showing size-performance reversals across two or more families. Only MultiFin-ja exhibited such reversals across two families (E5 and OpenAI), likely due to its significantly smaller sample size as shown in Table 2 (https://arxiv.org/html/2604.15882#S2.T2). Validation-Test Stability: We exclude tasks with validation-test score differences exceeding 20% in the three large models, indicating potential distribution mismatches or evaluation instabilities. One classification task (Industry 33) showed excessive validation-test divergence in the three models. Based on these criteria, we exclude MultiFin-ja and Industry 33 from the final benchmark. The resulting JFinTEB comprises 11 stable tasks across classification, retrieval, and clustering, ensuring consistent and interpretable evaluation results across diverse embedding models. ## 3. Evaluation We evaluate representative embedding models across different architectures and languages, primarily selected based on strong performance reported in JMTEB. Table 3 (https://arxiv.org/html/2604.15882#S3.T3) summarizes the model statistics, including parameter sizes and maximum input lengths. Japanese-Specialized Models: We evaluate leading Japanese embedding models: (1) Ruri v3 series (Ruri) (Tsukagoshi and Sasano, 2024 (https://arxiv.org/html/2604.15882#bib.bib7)), trained with contrastive learning and based on Japanese ModernBERT (Tsukagoshi et al., 2025 (https://arxiv.org/html/2604.15882#bib.bib20)); (2) Sarashina² https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b, which is derived from a 1.2B Japanese LLM with multi-stage training and achieves state-of-the-art JMTEB performance; and (3) GLuCoSE³ https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2, a LUKE-based (Yamada et al., 2020 (https://arxiv.org/html/2604.15882#bib.bib21)) model optimized for Japanese semantic tasks. Multilingual Models: We include three high-performing multilingual embedding families: (1) jina-embeddings-v3 (Sturua et al., 2024 (https://arxiv.org/html/2604.15882#bib.bib15)) (Jina), a multi-task model based on XLM-RoBERTa (Conneau et al., 2020 (https://arxiv.org/html/2604.15882#bib.bib17)) with 8192-token capacity and LoRA (Hu et al., 2022 (https://arxiv.org/html/2604.15882#bib.bib18)) adapters; (2) Multilingual E5 series (Wang et al., 2024 (https://arxiv.org/html/2604.15882#bib.bib11)) (E5); and (3) OpenAI text-embedding-3 (OpenAI). Domain Adaptation Baselines: We include Japanese BERT (BERT)⁴ https://huggingface.co/tohoku-nlp/bert-base-japanese and Japanese financial BERT (FinBERT) (Suzuki et al., 2023 (https://arxiv.org/
Similar Articles
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
This paper introduces FINESSE-Bench, a suite of eight specialized benchmarks with 3,993 questions for hierarchical evaluation of financial competencies in large language models, covering professional certification topics and applied trading tasks.
STEB: Style Text Embedding Benchmark
Introduces the Style Text Embedding Benchmark (STEB), a comprehensive open-source benchmark for standardizing evaluation of style embeddings across 96 datasets and 7 languages, finding that semantic embeddings fail on stylistic tasks.
MVEB: Massive Video Embedding Benchmark
This paper introduces MVEB, a large-scale benchmark for evaluating video embeddings across 23 tasks, finding that no single model dominates and that audio's contribution depends on dataset annotation provenance. It integrates into the MTEB ecosystem for unified multimodal evaluation.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
This paper introduces TabEmbed, a generalist embedding model for tabular data that unifies classification and retrieval tasks, along with TabBench, a new benchmark for evaluating tabular understanding.