Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management
Summary
This paper presents an overview of the second edition of the TalentCLEF challenge at CLEF 2026, which includes tasks on job-person matching and job-skill matching in English and Spanish, attracting over 400 submissions.
View Cached Full Text
Cached at: 07/01/26, 05:35 AM
# Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management Source: [https://arxiv.org/html/2606.31692](https://arxiv.org/html/2606.31692) 11institutetext:Avature Machine Learning, Spain 11email:machinelearning@avature\.net22institutetext:TechWolf, Belgium33institutetext:NLP & IR Group at UNED, Madrid, SpainHermenegildo FabregatLaura García\-SardiñaPaula EstrellaWarre VeysCasimiro Pío CarrinoMatthias De LangeDaniel Deniz CerpaÁlvaro RodrigoJens\-Joris DecorteRabih Zbib ###### Abstract This paper presents an overview of the second edition of the TalentCLEF challenge, organized as a Lab at the Conference and Labs of the Evaluation Forum \(CLEF\) 2026\. TalentCLEF is an initiative aimed at advancing Natural Language Processing research in Human Capital Management\. The second edition of the challenge consisted of two tasks: Task A, contextualized job\-person matching, focuses on identifying and ranking the most suitable candidates represented by their resumes for a given job vacancy in English and Spanish\. Task B, job\-skill matching with skill type classification, addresses retrieving the most relevant skills for a given job title in English and distinguishing between core and contextual skills\. TalentCLEF attracted 113 registered teams and received more than 400 submissions in the two tasks, reflecting the growing interest of the research community in shared evaluation benchmarks for Human Capital Management\. This paper describes the motivation and organization of the challenge, summarizes the datasets and evaluation settings, and reports the main results obtained by the participating teams\. ## 1Introduction The transformation of the labor market is changing the way organizations describe jobs, identify talent, and support career development\[[39](https://arxiv.org/html/2606.31692#bib.bib3)\]\. In this context, Human Capital Management \(HCM\) increasingly requires Natural Language Processing \(NLP\) systems capable of processing and connecting information about jobs, people, and skills\. Such systems are relevant not only for recruitment and talent acquisition, but also for broader workforce development scenarios, including career guidance, skill gap analysis, internal mobility, upskilling, and reskilling\. Language technologies are particularly well suited to address these challenges since much of the relevant information in this domain is expressed in text\. Job advertisements, job titles, professional profiles, curricula, learning resources, and skill or occupation taxonomies contain valuable information about the relationships between occupations, skills, workers, and learning opportunities\. By extracting, normalizing, matching, and linking this information, NLP methods can support more structured representations of the labor market and enable downstream applications for job and skill intelligence\. In recent years, the application of NLP to Human Capital Management has received growing research attention, as reflected in dedicated venues such asNLP4HR\[[28](https://arxiv.org/html/2606.31692#bib.bib6)\]andRecSys in HR\[[4](https://arxiv.org/html/2606.31692#bib.bib7)\]\. These initiatives have helped consolidate the area by bringing together research on traditional NLP tasks applied in the Human Resources \(HR\) area, such as skill extraction from job postings\[[70](https://arxiv.org/html/2606.31692#bib.bib32),[69](https://arxiv.org/html/2606.31692#bib.bib33),[45](https://arxiv.org/html/2606.31692#bib.bib26),[61](https://arxiv.org/html/2606.31692#bib.bib29)\], skill normalization to taxonomies\[[10](https://arxiv.org/html/2606.31692#bib.bib23),[52](https://arxiv.org/html/2606.31692#bib.bib12)\], matching\[[14](https://arxiv.org/html/2606.31692#bib.bib25),[36](https://arxiv.org/html/2606.31692#bib.bib30),[15](https://arxiv.org/html/2606.31692#bib.bib11)\], job recommendation\[[25](https://arxiv.org/html/2606.31692#bib.bib35),[13](https://arxiv.org/html/2606.31692#bib.bib22),[1](https://arxiv.org/html/2606.31692#bib.bib27),[18](https://arxiv.org/html/2606.31692#bib.bib13)\], but also novel ones such as career path modeling\[[12](https://arxiv.org/html/2606.31692#bib.bib24),[54](https://arxiv.org/html/2606.31692#bib.bib39)\], LLM comprehension\[[5](https://arxiv.org/html/2606.31692#bib.bib15),[64](https://arxiv.org/html/2606.31692#bib.bib40)\], or the analysis of fairness and bias in recruitment\-related systems\[[2](https://arxiv.org/html/2606.31692#bib.bib28),[19](https://arxiv.org/html/2606.31692#bib.bib16),[48](https://arxiv.org/html/2606.31692#bib.bib37),[56](https://arxiv.org/html/2606.31692#bib.bib38)\]\. Despite recent advances in this area, research remains fragmented\. Existing studies often rely on different datasets, languages, task definitions, annotation schemes, and evaluation protocols, making it difficult to compare systems or assess progress consistently\. This fragmentation is particularly problematic given that such systems can influence real\-world decision\-making scenarios, including recruitment\. Therefore, the development of public benchmarks remains an important need in the field, as they can help structure progress in similar ways to previous initiatives in other domains, such as biomedical NLP\[[44](https://arxiv.org/html/2606.31692#bib.bib78)\]\. TalentCLEF addresses this fragmentation by providing a shared evaluation framework for NLP systems in Human Capital Management\[[23](https://arxiv.org/html/2606.31692#bib.bib17),[21](https://arxiv.org/html/2606.31692#bib.bib18)\]\. The initiative is organized around competitive evaluation campaigns grounded in realistic job and skill intelligence scenarios, with the goal of promoting the development of robust, multilingual, and reusable language technologies\. At the same time, TalentCLEF provides the research community with high\-quality datasets, common evaluation protocols, and public benchmarks that support reproducible research and future system comparison\. In this paper, we present an overview of the TalentCLEF 2026 Challenge\. We describe the motivation of the challenge, introduce the proposed tasks, and summarize the main results obtained by the participating systems\. The challenge attracted 113 registered teams and received more than 400 submissions across the two tasks\. We also analyze the main methodological trends observed in the submitted systems, including hybrid retrieval, reranking approaches, generative AI components, and the use of structured knowledge sources such as skill and occupation graphs\. Detailed descriptions of each task, including dataset construction, annotation procedures, and task\-specific evaluation settings, are provided in the corresponding task overview articles\[[17](https://arxiv.org/html/2606.31692#bib.bib20),[62](https://arxiv.org/html/2606.31692#bib.bib21)\]\. ## 2Overview of the Tasks The second edition of the TalentCLEF Challenge\[[22](https://arxiv.org/html/2606.31692#bib.bib19)\]aims to promote the development and evaluation of systems for two highly relevant tasks in Human Capital Management: Task A, candidate search for a given job vacancy, and Task B, identification of professional skills relevant to specific job positions\. ### 2\.1Task A \- Contextualized Job\-Person Matching Candidate matching is one of the main challenges in Human Capital Management\. When performed manually, this process typically relies on the individual reading of the resumes, the interpretation of the job descriptions, and the expertise and judgment of the recruiters\. Although this approach allows human judgment and contextual knowledge to be incorporated into the decision\-making process, it is increasingly difficult to scale in today’s labor market, where a single job vacancy may receive hundreds of candidate profiles\[[32](https://arxiv.org/html/2606.31692#bib.bib41)\]\. As a result, manually reviewing such large volumes of applications is often not feasible in practice\. In response to these limitations, automatic matching systems based on NLP and information extraction techniques have been developed in recent years\. These systems usually focus on identifying and normalizing relevant entities in both job vacancies and candidate profiles, such as job titles, skills, competencies, education, and languages\[[68](https://arxiv.org/html/2606.31692#bib.bib8),[20](https://arxiv.org/html/2606.31692#bib.bib10)\]\. Once this information has been extracted, the systems compare the entities present in both types of document to estimate the degree of suitability between a candidate and a job vacancy\. This represents a valid and widely used approach\. However, the recent emergence of Large Language Models \(LLMs\) opens new possibilities for addressing the matching problem from a richer and more contextual perspective\. In particular, LLM\-based approaches can help incorporate additional dimensions into the matching process without requiring specific fine\-tuning, such as seniority level, evidence of expertise in key skills, demonstrated experience in real\-world contexts, and the inference of implicit or related skills, among others\[[60](https://arxiv.org/html/2606.31692#bib.bib42),[24](https://arxiv.org/html/2606.31692#bib.bib43),[41](https://arxiv.org/html/2606.31692#bib.bib44)\]\. This creates new opportunities for the development of more flexible and context\-aware candidate matching systems\. In the previous edition of TalentCLEF Task A, the problem focused exclusively on job title matching\. However, as discussed above, candidate matching involves many other types of information, which can be extracted from documents containing much richer descriptions of both candidates and job vacancies\. For this reason, Task A in this year’s edition has been framed as a broader problem of contextualized job\-person matching, where the aim is to develop systems capable of identifying and ranking the most suitable candidates for a given job vacancy\. To support this task, we provide a manually\-annotated synthetic corpus composed of job description and candidate profiles\. Participants are free to process this corpus using different methodological approaches, including information extraction, prompt engineering, information retrieval, or other NLP\-based techniques, to generate a ranked list of candidates for each job vacancy\. Figure 1:Overview of Task A: Contextualized Job–Person Matching#### 2\.1\.1Data In this task, we provide a multilingual dataset for contextualized person\-job matching, covering both English and Spanish\. The task corpus is divided into development and test set\. Training data were not provided for the task; however, participating teams were allowed to use any external resources or additional information they considered relevant\. Both partitions were synthetically generated, including job records and resumes that describe vacancy requirements and candidate profiles\. Rather than relying on uncontrolled generation, the process was guided by statistical evidence on job\-skill co\-occurrence patterns extracted from real\-world resumes and job descriptions, obtained from an internal database111Although we used real data for the data generation, a manual review on the statistical evidence was done for avoiding data leakage\.\. The selection of profiles and vacancies was designed to cover a wide range of industries, occupations, professional backgrounds, gender, and ethnicity\. The resulting candidate\-job pairs were subsequently reviewed and manually annotated by expert annotators for the matching task\. Table[1](https://arxiv.org/html/2606.31692#S2.T1)summarizes the main statistics of the corpus, including the number of job descriptions \(queries\), resumes \(corpus\) per language\. Data for this task were made available to participating teams through Zenodo222Task A corpus: https://doi\.org/10\.5281/zenodo\.17625261\. Further details on the dataset generation and annotation process are provided in the extended overview of the task\. Table 1:Statistics of Task A’s development and test sets by language\. #### 2\.1\.2Evaluation The task was evaluated through a competition hosted on Codabench333Task A Codabench: https://www\.codabench\.org/competitions/14226/, which provided the participating teams with a common environment to submit their predictions and access the official leaderboards\. In addition, the use of this platform allows the task to remain available as an open benchmark for continuous evaluation after the end of the challenge, supporting reproducibility and the comparison of future systems\. In this edition, we consider three evaluation settings\. The first is multilingual, in which both the job vacancy and the candidate resumes are written in the same language\. The second is cross\-lingual, with the job vacancy written in English and the resumes in Spanish\. This setting is relevant in multilingual contexts, where a company may publish a vacancy in one language while candidates describe their work experience in another\. The third evaluation focuses on bias\. Since job matching systems can have a direct impact on people’s employment opportunities, it is essential to analyze not only their overall performance, but also their behavior between demographic groups, such as gender or ethnicity\. In this case, we assess whether the systems produce consistent and fair rankings regardless of the candidate’s gender\. For monolingual and cross\-lingual scenarios, system performance is measured using Mean Average Precision \(MAP\) over the ranked list of candidates\. For the bias scenario, we use the Rank\-Biased Overlap \(RBO\) to evaluate gender bias\. ### 2\.2Task B \- Job\-Skill Matching with Skill Type Classification Skills have become a central component of HCM\. In recent years, the emergence of artificial intelligence and other technological transformations has accelerated changes in the labor market: new job roles are appearing at an unprecedented pace, existing occupations are being rapidly redefined, and the skill requirements associated with many positions are continuously evolving\. As a result, organizations increasingly need support systems not only to define new professional roles and the technical skills they require, but also to update the knowledge and capabilities of their workforce so that employees can progressively adapt to technological change\. This shift has reinforced the importance of skill\-based approaches in recruitment, workforce planning, and talent development\. In recruitment, such systems can help identify candidates whose capabilities match the requirements of a role, even when their previous job titles or career paths are not directly related to the vacancy\. In workforce management, they can support the identification of skill gaps and the recommendation of learning pathways that help workers adapt to new occupational demands\. Last year, Task B focused on retrieving the skills most relevant to a given job title\. This year, the task expands that setting by requiring systems not only to identify relevant skills, but also to consider whether each skill is core or contextual for the target job\. Core skills are those required to perform a job regardless of the work context or employer and are therefore essential to the position\. Contextual skills, on the contrary, depend on factors such as industry, organization, or a specific work environment and can be considered complementary or optional depending on the context\. The objective of Task B this year is to develop systems capable of understanding both the relevance and the role of professional skills in relation to job titles\. Given a database of professional skills and a specific job, participating systems are required to identify the most relevant skills and classify them according to their importance for the position\. Figure 2:Overview of Task B: Job\-Skill Matching with Skill Type Classification#### 2\.2\.1Data In Task B, we provide a monolingual corpus in English divided into three splits: a training set, a development set, and a test set\. The training data consist of a list of the most representative skills for each ESCO occupation\. A filtering process was applied to limit the number of skills associated with each job title and avoid outlier cases with an unusually large number of skills\. These data also specify whether each skill is essential or optional for the corresponding occupation\. For the development and test sets, we refined and expanded the dataset used in last year’s edition of the task\. The queries corresponding to job titles are the same as in the previous edition, but the corpus elements were expanded and revised to provide a richer and more reliable evaluation setting\. This process combined semantic techniques, human validation, and LLM\-as\-a\-judge procedures to improve coverage and ensure the quality of relevance annotations\. Table[2](https://arxiv.org/html/2606.31692#S2.T2)summarizes the main statistics of the corpus, including the number of queries, the corpus elements, and the average number of relevant items per query\. Data were made available through Zenodo444Task B corpus: https://doi\.org/10\.5281/zenodo\.17625261\. Further details on the generation and validation of the evaluation dataset are provided in the extended overview of the task\. Table 2:Statistics of Task B’s development and test sets\. #### 2\.2\.2Evaluation The evaluation of Task B was conducted using a Codabench competition555Codabench Task B: https://www\.codabench\.org/competitions/14489/\. The main evaluation metric was Normalized Discounted Cumulative Gain \(NDCG\), which assesses ranking quality by considering both the relevance of the retrieved items and their position in the ranked list\. Although the task includes a skill\-type classification component, this distinction is incorporated into the ranking evaluation through two relevance settings\. In the binary relevance scenario, all relevant skills are treated equally: both core and contextual skills receive the same relevance value\. This setting evaluates whether systems can retrieve skills that are relevant to a given job title, regardless of their type\. In the graded relevance scenario, the evaluation also accounts for the type of each relevant skill\. Core skills are assigned a higher relevance value of 2, while contextual skills are assigned a relevance value of 1\. This setting rewards systems that rank core skills higher, reflecting their greater importance in performing a specific occupation, while still giving credit for retrieving contextual skills that may be relevant depending on the work environment\. ## 3Participants ### 3\.1Task A Task A attracted 91 registered participants this year\. During the evaluation phase, 21 teams submitted at least one run, producing a total of 384 submissions across the development and test sets, with 12 teams sending working notes to be included in the official benchmark\. Table[3](https://arxiv.org/html/2606.31692#S3.T3)summarizes the main systems and approaches used in this task by the participating teams\.Representation typedescribes the type of text representation used in their submissions, whileMethodologycaptures the main methodological strategies adopted to address the task, such as retrieval or reranking\.System detailsreports additional information about the submitted systems, including learning objectives, fine\-tuning strategies, rank fusion methods, graph\-based components, or other relevant architectural choices\.LLM\-relatedidentifies the role of LLMs in the submitted systems\.Bias mitigationindicates whether specific mechanisms were introduced to address gender bias, andExternal datareports the use of additional resources beyond the data provided by the organizers\. Table 3:Overview of participating team approaches for Task A\.The solutions proposed this year for Task A are predominantly based on multi\-stage information retrieval architectures\. Participants consistently frame the problem as a retrieval task: they first generate an initial set of candidates using diverse textual representation methods and, in many cases, apply a subsequent reranking stage to refine the final ranking In the retrieval stage, models that generate dense semantic representations constitute the core component of the systems, but they are often complemented with lexical methods such as BM25, which appears in several submissions as an exact\-matching mechanism\. This might be particularly useful in the HCM domain, where normalized skill names and labor\-market expressions can play an important role in retrieving a good set of candidates\. In addition, models from the JobBERT family\[[13](https://arxiv.org/html/2606.31692#bib.bib22)\]are used by teams such asqwerity,classum, andHR\_NLP, trying to take advantage of models adapted to the labor\-market domain compared with more general\-purpose embedding models\. A particularly frequent pattern is the incorporation of reranking stages in the retrieved candidates\. Nine out of the 12 teams explicitly include some sort of reranking mechanism\.qwerity, for example, explores different strategies by combining lexical models such as SPLADE\[[35](https://arxiv.org/html/2606.31692#bib.bib62)\]with late\-interaction reranking based on ColBERT architectures\[[31](https://arxiv.org/html/2606.31692#bib.bib63)\]\. Other teams, such asVerbaNex,ui\-nlp, andSkillberg\.app, use pretrained cross\-encoders to improve the precision of the final ranking\. Because participants often combine different representations to produce the final ranking, rank fusion methods are a central method in many systems\. Techniques such as reciprocal rank fusion, late fusion, score fusion, geometric\-mean fusion, and other rank fusion variants are used to aggregate evidence from different rankings, enabling systems to combine complementary strengths: lexical models capture exact terminology, dense encoders capture semantic similarity, cross\-encoders improve relevance estimation, and LLM\-based rerankers, which are sometimes applied before and sometimes after rank fusion, provide more fine\-grained relevance estimates\. The widespread use of these methods highlights their importance for the task and suggests that integrating multiple sources of evidence often yields better results than relying on a single model\. One of the most relevant differences with respect to the first edition of the task is the more extensive use of LLMs\. While the previous edition provided more limited contextual information, this year several teams exploited generative models and prompting techniques to evaluate or reorder candidates\.CYUT,DevoMatcher,classum, andbipboopbipboopincorporated LLM\-based reranking strategies\. In particular, some systems used listwise reranking, where the model receives several candidates simultaneously and produces a comparative ordering\. This approach contrasts with the pointwise reranking strategy used bybipboopbipboop, where each candidate is evaluated independently rather than in comparison with the other candidates\. This team also proposed an additional variant based on tournament\-style listwise reranking, in which candidates are compared through successive rounds or pairwise competitions\[[8](https://arxiv.org/html/2606.31692#bib.bib65)\]\. Beyond reranking, LLMs are also used for other tasks within the pipeline, mainly as a data augmentation technique\.VerbaNexuses them for query expansion,Skillberg\.appapplies data augmentation, andCYUTincorporates Hypothetical Document Embeddings for query expansion \(HyDE\)\[[38](https://arxiv.org/html/2606.31692#bib.bib66)\]\. These techniques aim to enrich the original input with additional information to improve the system’s ability to retrieve relevant candidates\. In addition,bipboopbipboopuses an LLM\-as\-a\-judge approach to assess candidate adequacy during the retrieval stage, making it the only team to incorporate LLM knowledge directly into the retrieval task\. Two distinctive cases areQTPrideandclassum, which use LLM\-based entity extraction as a step prior to retrieval\. In addition,QTPridecombines the extracted entities with the information from the ESCO skill graph, which is fine\-tuned using a Relational Graph Convolutional Network to learn structure\-aware embeddings\[[55](https://arxiv.org/html/2606.31692#bib.bib67)\]\. This allows relational information between entities to be incorporated into the retrieval process\. Other teams also exploit graph\-based information:Skillberg\.appuses a proprietary knowledge graph of job\-skill relations, whileDevoMatcherincorporates a calibration signal based on the ESCO graph to adjust the final ranking\[[37](https://arxiv.org/html/2606.31692#bib.bib36)\]\. From a training perspective, not all teams perform task\-specific fine\-tuning of embedding models for the retrieval stage\. However, when fine\-tuning is applied, the strategies are varied\.bipboopbipboopuses LoRA techniques and cosine similarity loss to fine\-tune their models\[[29](https://arxiv.org/html/2606.31692#bib.bib68)\]\.Skillberg\.appproposes a curriculum learning methodology, using GISTEmbedLoss as the loss function and MatryoshkaLoss to reduce the effective size of the vectors\[[66](https://arxiv.org/html/2606.31692#bib.bib69),[57](https://arxiv.org/html/2606.31692#bib.bib70),[34](https://arxiv.org/html/2606.31692#bib.bib71)\]\.HR\_NLP, in turn, uses a teacher\-student knowledge distillation strategy to adapt their retrieval model\. Finally,UBCSemploys encoder\-decoder models to perform query expansion over development and test sets, reinforcing the idea that text generation can be used not only for ranking but also to enrich query representations\. ### 3\.2Task B Task B attracted 101 registered participants this year\. During the evaluation phase, 15 teams submitted at least one run, producing a total of 43 submissions across the development and test sets\. Table[4](https://arxiv.org/html/2606.31692#S3.T4)summarizes the main systems and approaches used by the participating teams in this task\. Table 4:Overview of participating team approaches for Task B\.As in Task A, the solutions proposed for Task B are based on information retrieval architectures that combine hybrid strategies, but with a stronger emphasis on representation learning, likely due to the large training and development sets provided\. In several systems, the initial retrieval step is followed by reranking, rank fusion, or LLM\-based refinement\. In the initial retrieval stage, encoder\-based models are the dominant model choice\. Several teams rely on general\-purpose pretrained embedding models, including all\-mpnet\-base\-v2, models from the e5 family\[[65](https://arxiv.org/html/2606.31692#bib.bib72)\], BGE models\[[7](https://arxiv.org/html/2606.31692#bib.bib73)\], and Qwen\-based embedding models\[[71](https://arxiv.org/html/2606.31692#bib.bib74)\]\. Some teams, such asNightSunandbipboopbipboop, also use domain\-adapted models from the JobBERT family\. Although lexical retrieval appears less frequently than in Task A, BM25 is still used byMARSADandOlive\. A pattern in Task B is the adaptation of dense retrievers to the provided supervision data\. Several teams fine\-tune their models with contrastive or metric\-learning objectives, with the aim of aligning the embedding space with the relevance patterns captured in the training and development sets\.baorphuc,Olive,ui\-nlp, andbipboopbipboopuse InfoNCE losses\[[46](https://arxiv.org/html/2606.31692#bib.bib75)\], whileSkillberg\.appapplies contrastive learning with Cached InfoNCE\. Similarly,classumandNightSunuse GISTEmbedLoss\[[57](https://arxiv.org/html/2606.31692#bib.bib70)\]\. Other fine\-tuning strategies include curriculum learning, used byOlive; domain adaptation and external data, used byhr\_gradient; and LoRA\-based adaptation of large embedding models, used bybipboopbipboop\. LLM\-related methods are used in several submissions, mainly for reranking, data augmentation, and query expansion\.NightSunuses LLM\-based pointwise reranking and a tournament pairwise reranker, whilebipboopbipboopcombines pointwise reranking with a listwise select\-then\-rank strategy\. In contrast,classumandhr\_gradientreport mainly the use of LLMs for data augmentation\. Generative models are also used to enrich the input before retrieval:NightSunuses Hypothetical Document Embeddings for query expansion, andbipboopbipboopalso includes query expansion as part of their pipeline\. These techniques seem especially useful when the input is short or lexically different from the target items, as is the case with the data provided\. ## 4Results ### 4\.1Task A Main Results #### 4\.1\.1Overall Task A Performance The main leaderboard, shown in Table[5](https://arxiv.org/html/2606.31692#S4.T5), reports the Mean Average Precision across the monolingual English and Spanish scenarios\.classumachieved the best overall performance, with an average MAP of 0\.7140, obtaining the highest results in both English–English \(0\.7122\) and Spanish–Spanish \(0\.7157\)\.QTPrideranked second overall, with an average MAP of 0\.6632, also achieving the second\-best results in both languages\.bipboopbipboopcompleted the top three with an average MAP of 0\.6532, showing balanced performance in English and Spanish\. It is also notable thatSkillberg\.appobtained a strong English score of 0\.6608, close to the second\-best result in that language, whileCYUTperformed comparatively better in Spanish \(0\.6213\) than in English \(0\.5974\)\. Table 5:Overview of team results for Task A\.The two top\-performing systems in Task A shared a common aspect in transforming job descriptions and resumes into more comparable semantic representations before ranking\.classumrelied on the semantic enrichment of both job descriptions and resumes, including extracted skill and task profiles, which were compared with structured representations of responsibilities, seniority, domain, and role compatibility inferred from the development set\. Then, they combined many dense semantic representations with LLM\-based listwise scoring, rubric\-based scoring, score fusion, and reranking\.QTPride, the second\-ranked team, followed a related multi\-view strategy: they first parsed job descriptions and resumes into structured JSON representations using an LLM, and then combined multiple retrieval views based on full\-text, work\-experience, and ESCO knowledge\-graph through reciprocal rank fusion\. A similar trend can also be observed inSkillberg\.app, which incorporated a proprietary job–skill knowledge graph as part of their training strategy\. The third\-ranked team,bipboopbipboop, proposed a particularly strong multi\-stage reranking pipeline: candidate resumes were first scored by complementary matchers, including neural rerankers, bi\-encoders, and an LLM\-as\-a\-judge system, then combined through geometric\-mean rank fusion, and finally refined with an LLM\-based listwise tournament over the top candidates\. The top\-performing systems suggest that job–person matching benefits from going beyond plain\-text similarity: structured document understanding, extracted entities, graph\-derived relations, fusion mechanisms, and selective reranking help capture different dimensions of candidate–job alignment\. #### 4\.1\.2Cross\-Lingual Performance To evaluate performance in the cross\-lingual setting, the metric considered was MAP for the English–Spanish language pair, where English queries were matched against Spanish corpus elements\. The results are presented in Table[6](https://arxiv.org/html/2606.31692#S4.T6)\.classumalso achieved the best cross\-lingual performance, reaching a MAP of 0\.7044\.bipboopbipboopranked second, with a MAP of 0\.6438, followed closely bySkillberg\.appwith a MAP of 0\.6373\.QTPridealso obtained a competitive score of 0\.6355, indicating that the top systems performed similarly in this setting\. The strongest cross\-lingual systems followed patterns similar to those observed in the multilingual setting\. The best\-performing system,classum, combined multiple cross\-lingual dense representations with enriched job and resumes views\.bipboopbipboop, the second\-ranked team in this benchmark, adapted their pipeline to the cross\-lingual setting by using only two of their four matchers: a multilingual BGE\-M3 bi\-encoder and a neural reranker\. In this configuration, it is likely that BGE\-M3 provided an important multilingual semantic signal within the final fused representation, which was subsequently refined through an LLM\-based listwise tournament\. Interestingly, for the teams at the top of the benchmark, the differences between the cross\-lingual and multilingual results were relatively small, suggesting that these systems were able to generalize effectively to the English–Spanish matching scenario\. Table 6:Overview of team results for Task A cross\-lingual setting\. Best value per column is in bold, second best is underlined\. #### 4\.1\.3Bias Evaluation In addition to standard retrieval metrics, we performed a bias\-oriented analysis by measuring the consistency of the rankings in the gendered variants of the same records\. Synthetic evaluation data was created from paired examples in which only gender\-marked elements, such as names and job titles, were modified, while the remaining profile information was kept unchanged\. This setup makes it possible to assess whether the systems preserve similar rankings when equivalent profiles are presented with masculine or feminine variants\. Systems with smaller discrepancies between these variants can therefore be considered to be more robust to this specific type of gender perturbation\. We used Rank\-Biased Overlap \(RBO\)\[[67](https://arxiv.org/html/2606.31692#bib.bib76)\]to compare the rankings obtained for the gendered variants in the English–English, English–Spanish, and Spanish–Spanish scenarios\. The average values are shown in Table[7](https://arxiv.org/html/2606.31692#S4.T7)\.classumachieved the highest overall RBO score, with an average of 0\.9904, obtaining the best results in English–English \(0\.9931\), English–Spanish \(0\.9892\), and Spanish–Spanish \(0\.9890\)\.hr\_gradientranked second overall, with an average RBO of 0\.9547, and achieved the second\-best results in English–English \(0\.9664\) and English–Spanish \(0\.9489\)\. Meanwhile,HR\_NLPranked third overall, with an average RBO of 0\.9521, and obtained the second\-best score in Spanish–Spanish \(0\.9538\)\. Table 7:Overview of team results for Task A using RBO metrics\. Best value per column is in bold, second best is underlined\.The high RBO scores obtained byclassumindicate that their rankings were highly stable under the gendered perturbations considered in this evaluation\. A plausible explanation is that, by extracting task and skill\-oriented profiles, the system likely reduced the influence of surface\-level lexical changes in names or gender\-marked job titles\. In addition, their use of multiple dense embedding models, LLM\-based scoring components, score fusion, and reranking may have further stabilized the final ranking by aggregating complementary signals rather than relying on a single textual representation that might be affected more by gender\-bias\. However, none of the top systems explicitly report a dedicated bias\-mitigation component for this setting, and pretrained embedding models or LLMs may still inherit biases from their underlying data\. Therefore, these results should be interpreted as evidence of ranking robustness under the specific masculine/feminine variants used in the benchmark, rather than as a comprehensive demonstration of fairness or bias mitigation\. ### 4\.2Task B Main Results The main Task B leaderboard, shown in Table[8](https://arxiv.org/html/2606.31692#S4.T8), reports the NDCG results using both graded and binary relevance\. Rows are ordered by NDCG\(graded\), the main evaluation metric\.classumachieved the best overall performance, with an NDCG\(binary\) of 0\.8340 and an NDCG\(graded\) of 0\.8068, obtaining the highest results in both metrics\.NightSunranked second overall, with an NDCG\(binary\) of 0\.8246 and an NDCG\(graded\) of 0\.7913, also achieving the second\-best results in both metrics\.bipboopbipboopcompleted the top three, with a NDCG\(binary\) of 0\.8123 and a NDCG\(graded\) of 0\.7793\. Table 8:Overview of team results for Task B using graded and binary NDCG metrics\. Rows are ordered by NDCG\(graded\), the main evaluation metric\. Best value per metric column is in bold, second best is underlined\.The approaches used by the best\-performing systems in Task B show a clear predominance of fine\-tuned dense retrieval methods combined with reranking and representation enrichment\.classum, the top\-ranked system, combined several encoder models, some of them fine\-tuned using contrastive learning objectives such as GISTEmbedLoss and CachedGISTEmbedLoss, and then applied a Zerank2 reranker\[[49](https://arxiv.org/html/2606.31692#bib.bib79)\]\. Their system also used prompt engineering for data augmentation and generated enriched job title and skill\-concept views, allowing the model to compare short job titles and ESCO skills through more informative representations\.NightSun, the second\-ranked team, also followed a retrieval\-and\-reranking strategy, combining a fine\-tuned encoder with query augmentation based on hypothetical document embeddings, followed by pointwise and pairwise reranking\.bipboopbipboop, ranked third, used a multi\-stage pipeline in which job titles were expanded before retrieval, candidates were retrieved using a fine\-tuned Qwen3 embedding model, the candidate pool was enriched with JobBERT\-v3 as a domain\-specialized model, and the final ranking was refined through pointwise and listwise LLM\-based reranking\. In a task where the available textual context was relatively limited, embedding fine\-tuning and data augmentation techniques seem to have played an important role in improving system performance\. In addition, the use of recent neural rerankers and LLM\-based reranking strategies, beyond traditional cross\-encoder reranking, may have contributed substantially to the performance of the strongest systems\. ## 5Conclusions TalentCLEF 2026 consolidated the first community evaluation campaign focused on NLP in Human Capital Management\. The second edition attracted substantial participation, with 113 registered participants, 29 teams submitting at least one run to the official benchmarks, and 17 teams contributing system papers\. Across the two tasks, the challenge received more than 400 submissions, reflecting the growing interest of the research community in reproducible benchmarks for job, candidate, and skill intelligence\. The reuse of previous TalentCLEF resources by several participants, including datasets and participant outputs generated during the first edition, indicates that the initiative is beginning to support research in this domain\[[11](https://arxiv.org/html/2606.31692#bib.bib77)\]\. It has also helped inspire the creation of new resources aimed at advancing research in the area\[[9](https://arxiv.org/html/2606.31692#bib.bib14)\]\. The TalentCLEF 2026 results highlight a clear methodological trend in both tasks: the strongest systems relied on hybrid and modular architectures rather than single\-model retrieval\. In Task A, contextualized job\-person matching was most effectively addressed as a multi\-stage retrieval and reranking problem, where semantic representations of job descriptions and resumes were combined with LLM\-generated structured views such as extracted skills, tasks, work experience, and, in some cases, graph\-derived information\. In Task B, job\-skill matching followed a similar retrieval\-oriented architecture, but with greater emphasis on adapting the embedding space to the task\. Fine\-tuned encoders trained with contrastive or metric\-learning objectives were often combined with query expansion, data augmentation, representation enrichment, fusion, and reranking to distinguish between core and contextual skills\. In both tasks, the best\-performing approaches show the value of combining complementary rankings\. Dense encoders provided robust semantic similarity, lexical or domain\-specific models captured terminology and labor\-market language, and graph\-based resources or extracted entity views added information not directly available from surface text\. Reranking also played an important role, with both LLM\-based and recent neural rerankers refining the top candidates retrieved in earlier stages\. From an evaluation perspective, the systems achieved strong results in all languages and showed different degrees of robustness in bias\-oriented evaluation, even though this aspect was not explicitly addressed by most of the participants\. For this reason, the next edition of the challenge will place greater emphasis on promoting and extending this type of evaluation\. ## References - \[1\]S\. Anand, J\. Decorte, and N\. Lowie\(2022\)Is it required? ranking the skills required for a job\-title\.CoRRabs/2212\.08553\.External Links:[Link](https://doi.org/10.48550/arXiv.2212.08553),[Document](https://dx.doi.org/10.48550/ARXIV.2212.08553),2212\.08553Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[2\]A\. M\. Arafan, D\. Graus, F\. P\. Santos, and E\. Beauxis\-Aussalet\(2022\)End\-to\-end bias mitigation in candidate recommender systems with fairness gates\.InProceedings of the 2nd Workshop on Recommender Systems for Human Resources \(RecSys\-in\-HR 2022\) co\-located with the 16th ACM Conference on Recommender Systems \(RecSys 2022\), Seattle, USA, 18th\-23rd September 2022,M\. Kaya, T\. Bogers, D\. Graus, S\. Mesbah, C\. Johnson, and F\. Gutiérrez \(Eds\.\),CEUR Workshop Proceedings, Vol\.3218\.External Links:[Link](https://ceur-ws.org/Vol-3218/RecSysHR2022-paper%5C_6.pdf)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[3\]P\. D\. Bao and T\. N\. H\. Duy\(2026\)In\-Domain Contrastive Fine\-tuning of Sentence\-BERT for ESCO\-based Job–Skill Retrieval and Classification\.InCLEF \(Working Notes\),Cited by:[Table 4](https://arxiv.org/html/2606.31692#S3.T4.15.15.15.6)\. - \[4\]T\. Bogers, D\. Graus, M\. Kaya, C\. Johnson, J\. Decorte, and T\. De Bie\(2024\)Fourth workshop on recommender systems for human resources \(recsys in hr 2024\)\.InProceedings of the 18th ACM Conference on Recommender Systems,RecSys ’24,New York, NY, USA,pp\. 1222–1226\.External Links:ISBN 9798400705052,[Link](https://doi.org/10.1145/3640457.3687109),[Document](https://dx.doi.org/10.1145/3640457.3687109)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[5\]C\. P\. Carrino, P\. Estrella, R\. Zbib, C\. Escolano, and J\. A\. Fonollosa\(2026\)JobResQA: a benchmark for llm machine reading comprehension on multilingual r\\\\backslash’esum\\\\backslash’es and jds\.arXiv preprint arXiv:2601\.23183\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[6\]A\. J\. Castañeira Rodríguez\(2026\)Semantic Reranking for Zero\-Shot Occupation\-Skill Linking\.InCLEF \(Working Notes\),Cited by:[Table 4](https://arxiv.org/html/2606.31692#S3.T4.68.68.68.14)\. - \[7\]J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu\(2024\)Bge m3\-embedding: multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.arXiv preprint arXiv:2402\.032164\(5\)\.Cited by:[§3\.2](https://arxiv.org/html/2606.31692#S3.SS2.p3.1)\. - \[8\]Y\. Chen, Q\. Liu, Y\. Zhang, W\. Sun, X\. Ma, W\. Yang, D\. Shi, J\. Mao, and D\. Yin\(2025\)Tourrank: utilizing large language models for documents ranking with a tournament\-inspired strategy\.InProceedings of the ACM on Web Conference 2025,pp\. 1638–1652\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p6.1)\. - \[9\]M\. De Lange, W\. Veys, F\. Retyk, D\. Deniz, W\. Jouanneau, M\. Zhang, A\. Bielinski, E\. Jouffroy, N\. Clobes, N\. Baranowska,et al\.\(2026\)WorkRB: a community\-driven evaluation framework for ai in the work domain\.arXiv preprint arXiv:2604\.13055\.Cited by:[§5](https://arxiv.org/html/2606.31692#S5.p1.1)\. - \[10\]A\. De Santo, L\. Malandri, F\. Mercorio, M\. Mezzanzanica, and N\. Nobani\(2026\)Skillens: recognising and mapping novel skills from millions of job ads across europe using language models\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 5: Industry Track\),pp\. 877–885\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[11\]J\. Decorte, M\. De Lange, and J\. Van Hautte\(2025\)Multilingual jobbert for cross\-lingual job title matching\.arXiv preprint arXiv:2507\.21609\.Cited by:[§5](https://arxiv.org/html/2606.31692#S5.p1.1)\. - \[12\]J\. Decorte, J\. V\. Hautte, J\. Deleu, C\. Develder, and T\. Demeester\(2023\)Career path prediction using resume representation learning and skill\-based matching\.InProceedings of the 3rd Workshop on Recommender Systems for Human Resources \(RecSys in HR 2023\) co\-located with the 17th ACM Conference on Recommender Systems \(RecSys 2023\), Singapore, Singapore, 18th\-22nd September 2023,M\. Kaya, T\. Bogers, D\. Graus, C\. Johnson, and J\. Decorte \(Eds\.\),CEUR Workshop Proceedings, Vol\.3490\.External Links:[Link](https://ceur-ws.org/Vol-3490/RecSysHR2023-paper%5C_1.pdf)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[13\]J\. Decorte, J\. V\. Hautte, T\. Demeester, and C\. Develder\(2021\)JobBERT: understanding job titles through skills\.CoRRabs/2109\.09605\.External Links:[Link](https://arxiv.org/abs/2109.09605),2109\.09605Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p3.1)\. - \[14\]J\. Decorte, J\. Van Hautte, T\. Demeester, and C\. Develder\(2024\)SkillMatch: evaluating self\-supervised learning of skill relatedness\.arXiv preprint arXiv:2410\.05006\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[15\]D\. Deniz, F\. Retyk, L\. García\-Sardiña, H\. Fabregat, L\. Gascó, and R\. Zbib\(2024\)Combined unsupervised and contrastive learning for multilingual job recommendation\.InProceedings of the 4th Workshop on Recommender Systems for Human Resources \(RecSys\-in\-HR 2024\) co\-located with the 18th ACM Conference on Recommender Systems \(RecSys 2024\), Bari, Italy, 14th\-18th October 2024,M\. Kaya, T\. Bogers, D\. Graus, C\. Johnson, J\. Decorte, and T\. D\. Bie \(Eds\.\),CEUR Workshop Proceedings, Vol\.3788\.External Links:[Link](https://ceur-ws.org/Vol-3788/RecSysHR2024-paper%5C_3.pdf)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[16\]D\. Dumitru\(2026\)Ranking job\-resume similarity using cross\-lingual bi\-encoding\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.86.86.86.11)\. - \[17\]H\. Fabregat, L\. García\-Sardiña, P\. Estrella, L\. Gasco, C\. P\. Carrino, D\. Deniz Cerpa, A\. Rodrigo, and R\. Zbib\(2026\)Overview of talentclef 2026: task a — contextualized job\-person matching\.InCLEF \(Working Notes\),Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p6.1)\. - \[18\]H\. Fabregat, R\. Poves, L\. L\. Alvarez, F\. Retyk, L\. García\-Sardiña, and R\. Zbib\(2024\)Inductive graph neural network for job\-skill framework analysis\.Procesamiento del Lenguaje Natural73,pp\. 83–94\.External Links:[Link](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6602),ISSN 1989\-7553Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[19\]L\. García\-Sardiña, H\. Fabregat, D\. Deniz, and R\. Zbib\(2025\)Measuring gender bias in job title matching for grammatical gender languages\.arXiv preprint arXiv:2509\.13803\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[20\]L\. García\-Sardiña, F\. Retyk, H\. Fabregat, L\. Alvarez Lacasa, R\. Poves, and R\. Zbib\(2023\)Normalisation of education information in digitalised recruitment processes\.Procesamiento del Lenguaje Natural71,pp\. 63–73\.External Links:ISSN 1989\-7553Cited by:[§2\.1](https://arxiv.org/html/2606.31692#S2.SS1.p2.1)\. - \[21\]L\. Gasco, H\. Fabregat, L\. García\-Sardiña, D\. Deniz, A\. Rodrigo, P\. Estrella, and R\. Zbib\(2025\)TalentCLEF at clef2025: skill and job title intelligence for human capital management\.InEuropean Conference on Information Retrieval,pp\. 479–486\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p5.1)\. - \[22\]L\. Gasco, H\. Fabregat, L\. García\-Sardiña, P\. Estrella, C\. P\. Carrino, D\. Deniz, A\. Rodrigo, and R\. Zbib\(2026\)TalentCLEF at clef2026: skill and job title intelligence for human capital management\.InEuropean Conference on Information Retrieval,Cited by:[§2](https://arxiv.org/html/2606.31692#S2.p1.1)\. - \[23\]L\. Gasco, H\. Fabregat, L\. García\-Sardiña, P\. Estrella, D\. Deniz, A\. Rodrigo, and R\. Zbib\(2025\)Overview of the talentclef 2025: skill and job title intelligence for human capital management\.InInternational Conference of the Cross\-Language Evaluation Forum for European Languages,pp\. 464–485\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p5.1)\. - \[24\]P\. Ghosh and V\. Sadaphal\(2023\)JobRecoGPT–explainable job recommendations using llms\.arXiv preprint arXiv:2309\.11805\.Cited by:[§2\.1](https://arxiv.org/html/2606.31692#S2.SS1.p2.1)\. - \[25\]A\. Giabelli, L\. Malandri, F\. Mercorio, M\. Mezzanzanica, and A\. Seveso\(2021\)Skills2Job: A recommender system that encodes job offer embeddings on graph databases\.Applied Soft Computing101,pp\. 107049\.External Links:[Link](https://doi.org/10.1016/j.asoc.2020.107049),[Document](https://dx.doi.org/10.1016/J.ASOC.2020.107049)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[26\]A\. Hamza\(2026\)Multi\-Stage CV Retrieval for TalentCLEF 2026 Task A: A Bi\-Encoder and Cross\-Encoder Pipeline\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.15.15.15.9)\. - \[27\]L\. Hemamou and R\. Dupont\(2026\)Retrieve, Rerank, Survive: LLM\-Listwise Pipelines for Job–Person and Job–Skill Matching at TalentCLEF 2026\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.131.131.131.17),[Table 4](https://arxiv.org/html/2606.31692#S3.T4.83.83.83.17)\. - \[28\]E\. Hruschka, T\. Lake, N\. Otani, and T\. Mitchell\(2024\)Proceedings of the first workshop on natural language processing for human resources \(nlp4hr 2024\)\.InProceedings of the First Workshop on Natural Language Processing for Human Resources \(NLP4HR 2024\),External Links:[Link](https://aclanthology.org/2024.nlp4hr-1.0/)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[29\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p9.1)\. - \[30\]S\. Ibrahim and W\. Zaghouani\(2026\)MARSAD at TalentCLEF 2026 Task B: Hybrid ESCO\-Aware Retrieval for Job–Skill Matching\.InCLEF \(Working Notes\),Cited by:[Table 4](https://arxiv.org/html/2606.31692#S3.T4.37.37.37.8)\. - \[31\]O\. Khattab and M\. Zaharia\(2020\)Colbert: efficient and effective passage search via contextualized late interaction over bert\.InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval,pp\. 39–48\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p4.1)\. - \[32\]K\. Khelkhal and D\. Lanasri\(2025\)Smart\-hiring: an explainable end\-to\-end pipeline for cv information extraction and job matching\.arXiv preprint arXiv:2511\.02537\.Cited by:[§2\.1](https://arxiv.org/html/2606.31692#S2.SS1.p1.1)\. - \[33\]S\. Kim, M\. Choi, and S\. Lee\(2026\)Semantic Enrichment for Resume Ranking and Skill Retrieval at TalentCLEF 2026\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.48.48.48.22),[Table 4](https://arxiv.org/html/2606.31692#S3.T4.31.31.31.18)\. - \[34\]A\. Kusupati, G\. Bhatt, A\. Rege, M\. Wallingford, A\. Sinha, V\. Ramanujan, W\. Howard\-Snyder, K\. Chen, S\. Kakade, P\. Jain,et al\.\(2022\)Matryoshka representation learning\.Advances in Neural Information Processing Systems35,pp\. 30233–30249\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p9.1)\. - \[35\]C\. Lassance, H\. Déjean, T\. Formal, and S\. Clinchant\(2024\)SPLADE\-v3: new baselines for splade\.arXiv preprint arXiv:2403\.06789\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p4.1)\. - \[36\]D\. Lavi, V\. Medentsiy, and D\. Graus\(2021\)ConSultantBERT: fine\-tuned siamese sentence\-bert for matching jobs and job seekers\.InProceedings of the Workshop on Recommender Systems for Human Resources \(RecSys in HR 2021\) co\-located with the 15th ACM Conference on Recommender Systems \(RecSys 2021\), Amsterdam, The Netherlands, 27th September \- 1st October 2021,M\. Kaya, T\. Bogers, D\. Graus, K\. Verbert, and F\. Gutiérrez \(Eds\.\),CEUR Workshop Proceedings, Vol\.2967\.External Links:[Link](https://ceur-ws.org/Vol-2967/paper%5C_8.pdf)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[37\]M\. le Vrang, A\. Papantoniou, E\. Pauwels, P\. Fannes, D\. Vandensteen, and J\. D\. Smedt\(2014\)ESCO: boosting job matching in europe with semantic interoperability\.Computer47\(10\),pp\. 57–64\.External Links:[Link](https://doi.org/10.1109/MC.2014.283),[Document](https://dx.doi.org/10.1109/MC.2014.283)Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p8.1)\. - \[38\]M\. Li, X\. Lv, J\. Zou, T\. Chen, C\. Zhang, S\. An, E\. Nie, and G\. Zhou\(2025\)Query expansion in the age of pre\-trained and large language models: a comprehensive survey\.ACM Transactions on Information Systems\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p7.1)\. - \[39\]LinkedIn\(2025\-06\)The Skills signal: Unlocking the opportunity in a changing labor market\.LinkedIn Corporation\.Note:Accessed: 2026\-06\-03External Links:[Link](https://economicgraph.linkedin.com/content/dam/me/business/en-us/talent-solutions/resources/pdfs/the-skills-signal-report-2025.pdf)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p1.1)\. - \[40\]Y\. Liu, Z\. Niu, C\. Li, and Z\. Fan\(2026\)Pointwise and Pairwise LLM\-as\-a\-Reranker for Job\-Title to ESCO Skill Retrieval over GIST\-Fine\-Tuned JobBERT\.InCLEF \(Working Notes\),Cited by:[Table 4](https://arxiv.org/html/2606.31692#S3.T4.11.11.11.13)\. - \[41\]F\. P\. Lo, J\. Qiu, Z\. Wang, H\. Yu, Y\. Chen, G\. Zhang, and B\. Lo\(2025\)AI hiring with llms: a context\-aware and explainable multi\-agent framework for resume screening\.InProceedings of the computer vision and pattern recognition conference,pp\. 4184–4193\.Cited by:[§2\.1](https://arxiv.org/html/2606.31692#S2.SS1.p2.1)\. - \[42\]I\. Q\. Luthfiyyah, S\. Yudhoatmojo, and I\. Budi\(2026\)ui\-nlp at TalentCLEF 2026: Multilingual Bi\-Encoder Retrieval with Ensemble Strategy for Job\-Person and Job\-Skill Matching\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.116.116.116.11),[Table 4](https://arxiv.org/html/2606.31692#S3.T4.56.56.56.6)\. - \[43\]M\. Moreno, J\. C\. Martinez\-Santos, and E\. Puertas\(2026\)VerbaNex at TalentCLEF 2026: A Hybrid Multilingual Retrieval and Approach for Contextualized Job\-Person Matching\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.107.107.107.14)\. - \[44\]A\. Nentidis, A\. Krithara, G\. Paliouras, M\. Krallinger, L\. G\. Sanchez, S\. Lima, E\. Farre, N\. Loukachevitch, V\. Davydova, and E\. Tutubalina\(2024\)BioASQ at clef2024: the twelfth edition of the large\-scale biomedical semantic indexing and question answering challenge\.InEuropean Conference on Information Retrieval,pp\. 490–497\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p4.1)\. - \[45\]K\. Nguyen, M\. Zhang, S\. Montariol, and A\. Bosselut\(2024\)Rethinking skill extraction in the job market domain using large language models\.InProceedings of the First Workshop on Natural Language Processing for Human Resources \(NLP4HR 2024\),pp\. 27–42\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[46\]A\. v\. d\. Oord, Y\. Li, and O\. Vinyals\(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§3\.2](https://arxiv.org/html/2606.31692#S3.SS2.p4.1)\. - \[47\]A\. Palomino\(2026\)Hybrid Dense\-Sparse Retrieval and Re\-Ranking for Multilingual Resume Retrieval from Job Descriptions\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.8.8.8.10)\. - \[48\]A\. Peña, J\. Fierrez, A\. Morales, G\. Mancera, M\. Lopez\-Duran, and R\. Tolosana\(2025\)Addressing bias in llms: strategies and application to fair ai\-based recruitment\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Vol\.8,pp\. 1976–1987\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[49\]N\. Pipitone, G\. H\. Alami, A\. Avadhanam, A\. Kaminskyi, and A\. Khoo\(2025\)ZELO: elo\-inspired training method for rerankers and embedding models\.arXiv preprint arXiv:2509\.12541\.Cited by:[§4\.2](https://arxiv.org/html/2606.31692#S4.SS2.p2.1)\. - \[50\]J\. Poerwanto and S\. Wu\(2026\)CYUT at TalentCLEF 2026: Zero\-Shot Multi\-View Retrieval for Job\-Person Matching\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.28.28.28.15)\. - \[51\]D\. Ranjan, A\. S, V\. S\. K\. R\. R\. Harrison, and L\. P\. S\. Rao\(2026\)Fine\-tuned Bi\-Encoder for Job\-Skill Matching: Exploring Curriculum Learning, Hard Negative Mining, and Hybrid Retrieval\.InCLEF \(Working Notes\),Cited by:[Table 4](https://arxiv.org/html/2606.31692#S3.T4.46.46.46.11)\. - \[52\]F\. Retyk, L\. Gascó, C\. P\. Carrino, D\. Deniz, and R\. Zbib\(2024\)MELO: an evaluation benchmark for multilingual entity linking of occupations\.InProceedings of the 4th Workshop on Recommender Systems for Human Resources \(RecSys\-in\-HR 2024\) co\-located with the 18th ACM Conference on Recommender Systems \(RecSys 2024\), Bari, Italy, 14th\-18th October 2024,M\. Kaya, T\. Bogers, D\. Graus, C\. Johnson, J\. Decorte, and T\. D\. Bie \(Eds\.\),CEUR Workshop Proceedings, Vol\.3788\.External Links:[Link](https://ceur-ws.org/Vol-3788/RecSysHR2024-paper%5C_2.pdf)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[53\]A\. Riabi and M\. Essayeh\(2026\)Taxonomy\-Aware Hybrid Retrieval with Bounded LLM Reranking for TalentCLEF 2026 Task A\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.61.61.61.15)\. - \[54\]A\. Sathish, V\. S\. Rajkumar, V\. Vijay, and C\. Kathiravan\(2024\)The significance of artificial intelligence in career progression and career pathway development\.InAI\-Oriented Competency Framework for Talent Management in the Digital Economy,pp\. 28–41\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[55\]M\. Schlichtkrull, T\. N\. Kipf, P\. Bloem, R\. Van Den Berg, I\. Titov, and M\. Welling\(2018\)Modeling relational data with graph convolutional networks\.InEuropean semantic web conference,pp\. 593–607\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p8.1)\. - \[56\]S\. S\. Sivakaminathan and E\. Musi\(2026\)ChatGPT is a gender bias echo\-chamber in hr recruitment: an nlp analysis and framework to uncover the language roots of bias\.AI & SOCIETY41\(4\),pp\. 2841–2861\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[57\]A\. V\. Solatorio\(2024\)Gistembed: guided in\-sample selection of training negatives for text embedding fine\-tuning\.arXiv preprint arXiv:2402\.16829\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p9.1),[§3\.2](https://arxiv.org/html/2606.31692#S3.SS2.p4.1)\. - \[58\]E\. Thuma and G\. Anderson\(2026\)A Union Fusion of PyTerrier, Doc2Query, BM25 and Multilingual Embeddings for Job Title Matching\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.95.95.95.11)\. - \[59\]H\. Tran\(2026\)QTPride @TalentCLEF 2026: Towards Multi\-View Job\-Person Matching with LLM Parsing and Knowledge Graph\-Enhanced Retrieval\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.138.138.138.9)\. - \[60\]S\. Vaishampayan, H\. Leary, Y\. B\. Alebachew, L\. Hickman, B\. Stevenor, W\. Beck, and C\. Brown\(2025\)Human and llm\-based resume matching: an observational study\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 4823–4838\.Cited by:[§2\.1](https://arxiv.org/html/2606.31692#S2.SS1.p2.1)\. - \[61\]N\. Vermeer, V\. Provatorova, D\. Graus, T\. Rajapakse, and S\. Mesbah\(2022\)Using robbert and extreme multi\-label classification to extract implicit and explicit skills from dutch job descriptions\.Compjobs’ 22: Computational Jobs Marketplace1\(1\),pp\. 2–6\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[62\]W\. Veys, L\. Gasco, M\. De Lange, and J\. Decorte\(2026\)Overview of talentclef 2026: task b — job\-skill matching with skill type classification\.InCLEF \(Working Notes\),Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p6.1)\. - \[63\]M\. Vielvoye\(2026\)Skillberg at TalentCLEF 2026: Cross\-Framework Knowledge\-Graph Grounded Multilingual Encoders for Job–Person and Job–Skill Matching\.InCLEF \(Working Notes\),Cited by:[Table 3](https://arxiv.org/html/2606.31692#S3.T3.77.77.77.18),[Table 4](https://arxiv.org/html/2606.31692#S3.T4.52.52.52.8)\. - \[64\]V\. Vijayalakshmi, A\. Ananya, and M\. Sharanya Avadhani\(2024\)Optimization of hr recruitment process using large language model \(llm\)\.In2024 First International Conference on Innovations in Communications, Electrical and Computer Engineering \(ICICEC\),pp\. 1–5\.Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[65\]L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei\(2024\)Multilingual e5 text embeddings: a technical report\.arXiv preprint arXiv:2402\.05672\.Cited by:[§3\.2](https://arxiv.org/html/2606.31692#S3.SS2.p3.1)\. - \[66\]X\. Wang, Y\. Chen, and W\. Zhu\(2021\)A survey on curriculum learning\.IEEE transactions on pattern analysis and machine intelligence44\(9\),pp\. 4555–4576\.Cited by:[§3\.1](https://arxiv.org/html/2606.31692#S3.SS1.p9.1)\. - \[67\]W\. Webber, A\. Moffat, and J\. Zobel\(2010\)A similarity measure for indefinite rankings\.ACM Transactions on Information Systems \(TOIS\)28\(4\),pp\. 1–38\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.31692#S4.SS1.SSS3.p2.1)\. - \[68\]R\. Zbib, L\. L\. Alvarez, F\. Retyk, R\. Poves, J\. Aizpuru, H\. Fabregat, V\. Simkus, and E\. G\. Casademont\(2022\)Learning job titles similarity from noisy skill labels\.CoRRabs/2207\.00494\.External Links:[Link](https://doi.org/10.48550/arXiv.2207.00494),[Document](https://dx.doi.org/10.48550/ARXIV.2207.00494),2207\.00494Cited by:[§2\.1](https://arxiv.org/html/2606.31692#S2.SS1.p2.1)\. - \[69\]M\. Zhang, K\. N\. Jensen, S\. D\. Sonniks, and B\. Plank\(2022\)SkillSpan: hard and soft skill extraction from english job postings\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10\-15, 2022,M\. Carpuat, M\. de Marneffe, and I\. V\. M\. Ruíz \(Eds\.\),pp\. 4962–4984\.External Links:[Link](https://doi.org/10.18653/v1/2022.naacl-main.366),[Document](https://dx.doi.org/10.18653/V1/2022.NAACL-MAIN.366)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[70\]M\. Zhang, K\. N\. Jensen, R\. van der Goot, and B\. Plank\(2022\)Skill extraction from job postings using weak supervision\.InProceedings of the 2nd Workshop on Recommender Systems for Human Resources \(RecSys\-in\-HR 2022\) co\-located with the 16th ACM Conference on Recommender Systems \(RecSys 2022\), Seattle, USA, 18th\-23rd September 2022,M\. Kaya, T\. Bogers, D\. Graus, S\. Mesbah, C\. Johnson, and F\. Gutiérrez \(Eds\.\),CEUR Workshop Proceedings, Vol\.3218\.External Links:[Link](https://ceur-ws.org/Vol-3218/RecSysHR2022-paper%5C_10.pdf)Cited by:[§1](https://arxiv.org/html/2606.31692#S1.p3.1)\. - \[71\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin,et al\.\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[§3\.2](https://arxiv.org/html/2606.31692#S3.SS2.p3.1)\.
Similar Articles
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents
SkillCAT is a training-free framework for LLM agent skill self-evolution that addresses limitations of single-trace bias, unverified merging, and full corpus loading via three stages: Contrastive Causal Extraction, Assessment-Augmented Evolution, and Topology-Aware Task Execution, achieving up to 40.40% improvement on benchmarks.
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
OpenSkillEval is an automatic evaluation framework for auditing open-source skills used by LLM agents across multiple downstream tasks. Using over 600 dynamically generated tasks and 30 skills, the authors find that skill availability does not guarantee effective usage and that benefits depend heavily on the model and framework.
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation
This paper presents COLLEAGUE.SKILL, an open-source system for automatically distilling person-grounded AI skills from heterogeneous traces into inspectable, correctable, and portable skill packages, enabling LLM agents to carry bounded representations of human expertise and interaction style.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.