Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis
Summary
This paper introduces Multi-Adapter PPO, a reinforcement learning framework with cross-attention for wavelength selection in LIBS quantitative analysis, achieving 28.4% better composite scores and 45.2% improvement in prediction accuracy over traditional methods on steel and coal datasets.
View Cached Full Text
Cached at: 06/17/26, 05:39 AM
# Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis
Source: [https://arxiv.org/html/2606.17476](https://arxiv.org/html/2606.17476)
###### Abstract
Laser\-induced breakdown spectroscopy \(LIBS\) quantitative analysis faces critical challenges in wavelength selection due to high\-dimensional spectral data and the fundamental trade\-off between prediction accuracy and feature efficiency\. This paper presents a novel Multi\-Adapter PPO framework that transforms wavelength selection into a reinforcement learning problem, leveraging cross\-attention mechanisms and multiple specialized adapters to capture complex spectral relationships\. Our approach outperforms traditional Particle Swarm Optimization \(PSO\) by an average of 28\.4% in comprehensive score and 45\.2% in prediction accuracy across steel and coal datasets\. The proposed method demonstrates superior performance in balancing prediction accuracy with feature efficiency, achieving state\-of\-the\-art results in LIBS quantitative analysis while maintaining interpretability and computational efficiency\. We released our code and dataset here: https://github\.com/Hflying/MAPPO
## IIntroduction
Laser\-induced breakdown spectroscopy \(LIBS\) has established itself as a versatile analytical technique for elemental detection and quantification, with applications spanning environmental monitoring, materials science, and industrial quality control\. High\-accuracy quantitative measurement via LIBS mainly deal with two critical challenges, spectral noise\[[28](https://arxiv.org/html/2606.17476#bib.bib1),[7](https://arxiv.org/html/2606.17476#bib.bib2),[20](https://arxiv.org/html/2606.17476#bib.bib3),[3](https://arxiv.org/html/2606.17476#bib.bib4),[9](https://arxiv.org/html/2606.17476#bib.bib5),[8](https://arxiv.org/html/2606.17476#bib.bib12),[6](https://arxiv.org/html/2606.17476#bib.bib13),[19](https://arxiv.org/html/2606.17476#bib.bib14)\]and data scarcity\[[17](https://arxiv.org/html/2606.17476#bib.bib20),[4](https://arxiv.org/html/2606.17476#bib.bib21),[24](https://arxiv.org/html/2606.17476#bib.bib22)\]\. In LIBS, each element emits characteristic spectral peaks at specific wavelengths, which serve as the cornerstone for both qualitative identification and quantitative measurement\. However, extracting reliable relationships between these spectral features and true elemental concentrations is complicated by matrix effects, overlapping peaks, and non\-linear correlations—factors that traditional data processing pipelines often struggle to address comprehensively\. Dimensionality reduction methods, such as principal component analysis \(PCA\)\[[22](https://arxiv.org/html/2606.17476#bib.bib15),[29](https://arxiv.org/html/2606.17476#bib.bib16),[15](https://arxiv.org/html/2606.17476#bib.bib17),[14](https://arxiv.org/html/2606.17476#bib.bib6)\], while effective in simplifying data complexity, often obscure the direct association between individual characteristic peaks and their corresponding elemental concentrations by transforming the original spectral space\. The physics\-driven interpretation of spectral features\[[1](https://arxiv.org/html/2606.17476#bib.bib23),[23](https://arxiv.org/html/2606.17476#bib.bib19)\]and their relationship to elemental composition provides not only theoretical validation but also insights for improving the robustness and interpretability of quantitative models\. Conversely, conventional wavelength selection algorithms, despite their focus on retaining informative wavelengths, frequently underperform in LIBS due to their inability to adapt to the dynamic and complex nature of spectral data, as highlighted by limitations in rigid thresholding or linear feature weighting approaches\.
Therefore, a critical challenge in LIBS quantitative analysis lies in wavelength selection, as raw spectral data is inherently high\-dimensional, containing redundant information, background noise, and spectral interferences that hinder accurate concentration prediction\. Variable selection methods have been proposed including Mutual Information\[[21](https://arxiv.org/html/2606.17476#bib.bib24)\], Chi\-squared test, and Information Gain \(IG\), \(they evaluate feature relevance based on statistical measures and are computationally efficient\) and Particle Swarm Optimization\[[27](https://arxiv.org/html/2606.17476#bib.bib25)\], Randomization Test\[[26](https://arxiv.org/html/2606.17476#bib.bib26)\], and Genetic Algorithm\[[16](https://arxiv.org/html/2606.17476#bib.bib27)\]\(they consider the relationship between feature subsets and learning models\)\. Particle swarm optimization \(PSO\)\[[5](https://arxiv.org/html/2606.17476#bib.bib18),[23](https://arxiv.org/html/2606.17476#bib.bib19)\], a widely adopted swarm intelligence algorithm, has been widely employed for wavelength selection in LIBS, leveraging collective search dynamics to identify optimal feature subsets\. However, traditional PSO and its hybride variants\[[30](https://arxiv.org/html/2606.17476#bib.bib28),[18](https://arxiv.org/html/2606.17476#bib.bib29),[13](https://arxiv.org/html/2606.17476#bib.bib9),[10](https://arxiv.org/html/2606.17476#bib.bib7),[12](https://arxiv.org/html/2606.17476#bib.bib10),[11](https://arxiv.org/html/2606.17476#bib.bib11),[25](https://arxiv.org/html/2606.17476#bib.bib8)\]suffer from limitations such as premature convergence and suboptimal precision when navigating the high\-dimensional and noisy landscape of LIBS spectra, where subtle but critical peaks \(e\.g\., those from trace elements\) are easily overshadowed by stronger signals or noise\.
Figure 1:Overview of Multi\-Adapter PPO Architecture\. The framework consists of dual encoders \(feature and target\) that process spectral data and target variables, followed by multi\-head cross\-attention to capture spectral–target relationships\. Four specialized adapters learn diverse feature\-target mapping patterns, which are aggregated through learnable weights\. The final policy network outputs action probabilities for wavelength selection or stopping\.If we delve into the essence of wavelength selection in LIBS spectral analysis, it inherently involves sequential decision\-making—where each choice of wavelength affects subsequent selections—and requires optimizing a specific objective \(e\.g\., enhancing signal informativeness while reducing noise\), which aligns perfectly with the core paradigm of reinforcement learning, where an agent learns to make sequential decisions to maximize cumulative rewards\. A similar insight has been applied in the hyperspectral band selection domain\. For instance, in\[[2](https://arxiv.org/html/2606.17476#bib.bib30)\], the authors transform the hyperspectral band selection task into a reinforcement learning problem, proposing an A2C\-based algorithm and leveraging a semi\-supervised EvaluateNet to assess the efficiency of selected bands\. This work validates the feasibility of framing spectral selection tasks within a reinforcement learning framework, laying a foundation for further algorithmic optimizations\. However, it is widely recognized that the A2C algorithm has limitations in terms of generality compared to more advanced alternatives\. In contrast, Proximal Policy Optimization \(PPO\) offers distinct advantages in handling sequential decision\-making and complex optimization tasks\. By integrating an actor\-critic framework with efficient, clipped policy updates, PPO can stably and adaptively prioritize informative wavelengths in LIBS spectral analysis while suppressing noise and interference\. This not only addresses the core demands of wavelength selection but also overcomes the generality constraints of A2C, making it far better suited for the nuanced and variable requirements of LIBS spectral analysis\. they transform the hyperspectral band selection task into a reinforcement learning problem and propose a A2C\-Based algorithm and use a semisupervised EvaluateNet to evaluate band effiency\.
This study addresses these limitations by first establishing the theoretical advantages of Proximal Policy Optimization \(PPO\) over traditional Particle Swarm Optimization \(PSO\) for sequential decision\-making problems\. While PSO suffers from premature convergence and lacks learning mechanisms in high\-dimensional spaces, PPO naturally handles sequential wavelength selection through its Actor\-Critic architecture and policy gradient optimization\. Building on this foundation, we propose and compare multiple enhanced PPO variants for wavelength selection\. Our key innovation lies in developing a comprehensive framework that evaluates different PPO architectures and their effectiveness in balancing prediction accuracy with feature efficiency\. the proposed methods are validated using a LIBS dataset collected from a custom\-built system optimized for coal and steel quality analysis, featuring high\-resolution spectral data \(180–800 nm wavelength range, 0\.1 nm resolution\) acquired from pulsed laser\-induced plasma emissions, with measurements spanning multiple elemental concentrations to reflect real\-world variability\.
Overall, our contribution summarizes as follows:
- •This work is the first work to model wavelength selection as a reinforcement learning process and theoretically demonstrate the superiority of PPO algorithms over PSO algorithms in this context\.
- •This work comprehensively compares multiple PPO deep network variants, all of which outperform PSO algorithms, with the best algorithm\(multi\-adapter PPO algorithm\) achieving a maximum 45\.2% accuracy improvement while maintaining the same number of features\.
- •This work develops and comprehensively evaluates multiple PPO deep evaluation network variants, analyzing the scenario\-specific applicability of each algorithm\.
- •Another significant contribution is the open\-sourced coal and steel LIBS datasets with true label acquired from pulsed laser\-induced plasma emissions\.
## IIProblem Formulation
### II\-AProblem Definition
Given a LIBS datasetD=\{\(xi,yi\)\}i=1nD=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}wherexi∈ℝdx\_\{i\}\\in\\mathbb\{R\}^\{d\}represents the spectral intensity atddwavelengths andyi∈ℝy\_\{i\}\\in\\mathbb\{R\}represents the elemental concentration, the wavelength selection problem aims to find a subsetS⊆\{1,2,…,d\}S\\subseteq\\\{1,2,\\ldots,d\\\}that maximizes the following objective function:
J\(S\)=P\(S\)−α⋅\|S\|dJ\(S\)=\\text\{P\}\(S\)\-\\alpha\\cdot\\frac\{\|S\|\}\{d\}\(1\)
where:
- •P\(S\)\\text\{P\}\(S\)measures prediction accuracy using wavelengths inSS\(e\.g\., negative RMSE orR2R^\{2\}score\)
- •\|S\|d\\frac\{\|S\|\}\{d\}is the feature selection ratio, penalizing excessive wavelength selection
- •α\\alphais a trade\-off parameter controlling the balance between performance and sparsity
This simple objective function provides a clear optimization target for the learning algorithm, balancing prediction accuracy with feature efficiency\. For comprehensive performance evaluation, we also compute the Pareto ScoreJPareto\(S\)=ComprehensiveScore\(S\)⋅EfficiencyScore\(S\)J\_\{Pareto\}\(S\)=\\text\{ComprehensiveScore\}\(S\)\\cdot\\text\{EfficiencyScore\}\(S\)which incorporates multiple performance metrics and feature quality measures\.
### II\-BPPO Algorithm Framework
Proximal Policy Optimization \(PPO\) represents a state\-of\-the\-art reinforcement learning algorithm that excels in sequential decision\-making tasks\. Unlike traditional optimization methods that suffer from premature convergence and lack of learning mechanisms, PPO employs an Actor\-Critic architecture that naturally handles the sequential nature of wavelength selection\.
Actor Network \(Policy\): The policy networkπθ\(at\|st\)\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)outputs action probabilities for selecting wavelengths or stopping, parameterized byθ\\theta\. It learns to map states to optimal action distributions through policy gradient updates\.
Critic Network \(Value\): The value networkVϕ\(st\)V\_\{\\phi\}\(s\_\{t\}\)estimates the expected return from statests\_\{t\}, providing baseline values for reducing variance in policy updates and enabling more stable learning\.
### II\-CEnhanced PPO Variants
Building upon the standard PPO framework, we develop multiple enhanced variants by modifying the Actor Network \(Policy\) to address specific challenges in wavelength selection, while maintaining the standard Critic Network \(Value\) architecture\. These variants can be categorized into three main groups:
Mutual Information\-Based Tricksincorporate mutual information theory to guide feature selection toward more informative wavelengths\. MI\-Regularized PPO modifies the policy network to incorporate mutual information constraints between selected wavelengths and target concentrations, guiding the agent to prioritize wavelengths with high informativeness while penalizing redundant selections through enhanced policy gradients\. The mutual information termI\(W;Y\)I\(W;Y\)between selected wavelengthsWWand target variableYYis computed and added as a regularization term in the policy loss, encouraging the agent to select wavelengths that maximize information gain\. Improved MI\-PPO extends this approach with explicit target feature number constraints, enabling more precise control over the final feature subset size through constrained policy optimization that balances information gain with sparsity requirements\.
Early Stopping Tricksimplement optimal stopping theory to determine the most appropriate termination point for wavelength selection\. Optimal Stopping PPO enhances the policy network with a learned stopping criterion that balances exploration \(selecting more wavelengths to potentially improve accuracy\) and exploitation \(stopping early to maintain feature efficiency\)\. The agent learns to predict the marginal benefit of selecting additional wavelengths and decides when further selection yields diminishing returns\. This is achieved by incorporating a patience mechanism and tracking validation performance, where the policy network learns to output a stop action when the expected improvement from additional wavelengths falls below a learned threshold\.
Advanced Policy Network Tricksreplace or enhance the standard MLP policy network with sophisticated architectures to capture complex spectral relationships\. Multi\-Adapter PPO, our best\-performing variant, replaces the standard policy network with cross\-attention mechanisms and multiple specialized adapters to capture diverse spectral–target relationships and improve feature selection accuracy through enhanced policy representation\. Transformer\-PPO substitutes the traditional MLP policy with a Transformer architecture, leveraging self\-attention mechanisms to better model long\-range dependencies in spectral data and capture global relationships between wavelengths\. CLIP\-PPO adopts a CLIP\-inspired dual encoder architecture with separate feature and target encoders in the policy network, enabling better understanding of spectral feature relationships through contrastive representation learning\. ICL\-PPO implements in\-context learning capabilities within the policy network, allowing the agent to adapt its feature selection strategy based on contextual information from similar spectral patterns observed during training\.
### II\-DMDP Formulation
We reformulate the wavelength selection problem as a Markov Decision Process \(MDP\) to leverage PPO’s sequential decision\-making capabilities:
State Space: At time steptt, the statests\_\{t\}is defined as:
st=\[𝐦t,𝐟t,𝐜t\]s\_\{t\}=\[\\mathbf\{m\}\_\{t\},\\mathbf\{f\}\_\{t\},\\mathbf\{c\}\_\{t\}\]\(2\)where𝐦t∈\{0,1\}d\\mathbf\{m\}\_\{t\}\\in\\\{0,1\\\}^\{d\}indicates selected wavelengths,𝐟t∈ℝd\\mathbf\{f\}\_\{t\}\\in\\mathbb\{R\}^\{d\}represents current feature importance scores, and𝐜t∈ℝk\\mathbf\{c\}\_\{t\}\\in\\mathbb\{R\}^\{k\}encodes contextual information about spectral characteristics\.
Action Space: The agent can either select a new wavelengthat∈\{1,2,…,d\}a\_\{t\}\\in\\\{1,2,\\ldots,d\\\}or stop the selection processat=stopa\_\{t\}=\\text\{stop\}\.
Transition Function: The state transitions deterministically based on the selected action, updating the feature mask and recalculating mutual information metrics\.
Multi\-Adapter PPO Reward Function: Unlike classical PPO variants that use complex reward functions with mutual information and redundancy penalties, our Multi\-Adapter PPO employs a simplified yet highly effective reward function:
R\(st,at\)=\{P\(St\)−α⋅\|St\|difat=stop0\.001ifatselects new wavelength−0\.01ifatselects existing wavelengthR\(s\_\{t\},a\_\{t\}\)=\\begin\{cases\}P\(S\_\{t\}\)\-\\alpha\\cdot\\frac\{\|S\_\{t\}\|\}\{d\}&\\text\{if \}a\_\{t\}=\\text\{stop\}\\\\ 0\.001&\\text\{if \}a\_\{t\}\\text\{ selects new wavelength\}\\\\ \-0\.01&\\text\{if \}a\_\{t\}\\text\{ selects existing wavelength\}\\end\{cases\}\(3\)
whereα=2\.0\\alpha=2\.0controls the sparsity penalty\.
Cross\-Attention Mechanism: Unlike classical PPO’s simple MLP policy network, Multi\-Adapter PPO employs multi\-head cross\-attention to capture complex spectral relationships\. The cross\-attention between feature encoder output𝐅∈ℝd×dmodel\\mathbf\{F\}\\in\\mathbb\{R\}^\{d\\times d\_\{model\}\}and target encoder output𝐓∈ℝ1×dmodel\\mathbf\{T\}\\in\\mathbb\{R\}^\{1\\times d\_\{model\}\}is computed as:
Attention\(𝐅,𝐓\)=softmax\(𝐅𝐖Q\(𝐓𝐖K\)Tdmodel\)𝐓𝐖V\\text\{Attention\}\(\\mathbf\{F\},\\mathbf\{T\}\)=\\text\{softmax\}\\left\(\\frac\{\\mathbf\{F\}\\mathbf\{W\}\_\{Q\}\(\\mathbf\{T\}\\mathbf\{W\}\_\{K\}\)^\{T\}\}\{\\sqrt\{d\_\{model\}\}\}\\right\)\\mathbf\{T\}\\mathbf\{W\}\_\{V\}\(4\)
where𝐖Q\\mathbf\{W\}\_\{Q\},𝐖K\\mathbf\{W\}\_\{K\},𝐖V\\mathbf\{W\}\_\{V\}are learnable query, key, and value matrices, anddmodel=128d\_\{model\}=128is the model dimension\.
Multi\-Adapters: Four specialized adapters\{Ai\}i=14\\\{A\_\{i\}\\\}\_\{i=1\}^\{4\}process the cross\-attention output with learnable weights\{αi\}i=14\\\{\\alpha\_\{i\}\\\}\_\{i=1\}^\{4\}:
𝐡adapter=∑i=14αi⋅Ai\(Attention\(𝐅,𝐓\)\)\\mathbf\{h\}\_\{adapter\}=\\sum\_\{i=1\}^\{4\}\\alpha\_\{i\}\\cdot A\_\{i\}\(\\text\{Attention\}\(\\mathbf\{F\},\\mathbf\{T\}\)\)\(5\)
whereαi=softmax\(𝐰i\)\\alpha\_\{i\}=\\text\{softmax\}\(\\mathbf\{w\}\_\{i\}\)and𝐰i\\mathbf\{w\}\_\{i\}are learnable parameters\.
Policy Network: The final action probabilities are computed as:
π\(a\|s\)=softmax\(𝐖actor⋅𝐡adapter\+𝐛actor\)\\pi\(a\|s\)=\\text\{softmax\}\(\\mathbf\{W\}\_\{actor\}\\cdot\\mathbf\{h\}\_\{adapter\}\+\\mathbf\{b\}\_\{actor\}\)\(6\)
### II\-EPolicy Update Mechanism
The PPO algorithm updates both policy and value networks using the following mechanism:
Policy Loss: Using the clipped surrogate objective:
LCLIP\(θ\)=𝔼t\[min\(rt\(θ\)At,clip\(rt\(θ\),1−ϵ,1\+ϵ\)At\)\]L^\{CLIP\}\(\\theta\)=\\mathbb\{E\}\_\{t\}\\left\[\\min\(r\_\{t\}\(\\theta\)A\_\{t\},\\text\{clip\}\(r\_\{t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)A\_\{t\}\)\\right\]\(7\)wherert\(θ\)=πθ\(at\|st\)πθold\(at\|st\)r\_\{t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\pi\_\{\\theta\_\{old\}\}\(a\_\{t\}\|s\_\{t\}\)\}is the probability ratio andAtA\_\{t\}is the advantage function\.
Value Loss: Mean squared error between predicted and actual returns:
LVF\(ϕ\)=𝔼t\[\(Vϕ\(st\)−Rt\)2\]L^\{VF\}\(\\phi\)=\\mathbb\{E\}\_\{t\}\\left\[\(V\_\{\\phi\}\(s\_\{t\}\)\-R\_\{t\}\)^\{2\}\\right\]\(8\)
Total Loss: For Multi\-Adapter PPO, the combined optimization objective includes additional regularization terms:
LTOTAL=LCLIP\(θ\)\+0\.5⋅LVF\(ϕ\)\+λadapter⋅‖𝐰adapter‖2L^\{TOTAL\}=L^\{CLIP\}\(\\theta\)\+0\.5\\cdot L^\{VF\}\(\\phi\)\+\\lambda\_\{adapter\}\\cdot\\\|\\mathbf\{w\}\_\{adapter\}\\\|\_\{2\}\(9\)whereλadapter=0\.01\\lambda\_\{adapter\}=0\.01is the adapter regularization weight, and𝐰adapter\\mathbf\{w\}\_\{adapter\}represents the learnable adapter weights\. This additional regularization term prevents overfitting of the multiple adapters and ensures balanced learning across different feature\-target mapping patterns\.
### II\-FRegret Bound Analysis
For Multi\-Adapter PPO with learning rateη=1T\\eta=\\frac\{1\}\{\\sqrt\{T\}\}, the expected regret is bounded by:
𝔼\[Regret\(T\)\]≤O\(d⋅\|A\|⋅T⋅logT\)\\mathbb\{E\}\[\\text\{Regret\}\(T\)\]\\leq O\\left\(\\sqrt\{d\\cdot\|A\|\\cdot T\\cdot\\log T\}\\right\)\(10\)whereddis the feature dimension,\|A\|\|A\|is the action space size\.
Proof\.Under bounded rewards\|R\(s,a\)\|≤Rmax\|R\(s,a\)\|\\leq R\_\{\\max\}and bounded advantage, the policy gradient estimator has varianceO\(d⋅\|A\|\)O\(d\\cdot\|A\|\)per step\. Withη≍1/T\\eta\\asymp 1/\\sqrt\{T\}, the accumulated deviation of the policy from the optimal \(in expectation\) is controlled by a martingale concentration argument, yielding theTlogT\\sqrt\{T\\log T\}term; thed⋅\|A\|\\sqrt\{d\\cdot\|A\|\}factor comes from the variance bound and the policy class dimension\. Thus the expected cumulative loss relative toJ\(S∗\)J\(S^\{\*\}\)isO\(d⋅\|A\|⋅T⋅logT\)O\(\\sqrt\{d\\cdot\|A\|\\cdot T\\cdot\\log T\}\)\.
TABLE I:Key HyperparametersMethodParameterValuePPOepisodesγ\\gammaϵ\\epsilonα\\alphaagent LRevaluator LRbatch size30–500\.990\.22\.010−310^\{\-3\}10−410^\{\-4\}32Multi\-Adapter / Transformerdmodeld\_\{\\text\{model\}\}headsadapterslayers128843PSO / MI\-PSOparticlesmax iterwwc1c\_\{1\}c2c\_\{2\}MI thresholdCV folds20500\.71\.51\.50\.33Early stoppingpatienceminΔ\\Delta1010−410^\{\-4\}
## IIIExperiments and analysis
### III\-ADataset and Experimental Setup
TABLE II:Algorithm Performance Comparison \(Steel / Coal\)AlgorithmComprehensive Score↑\\uparrowMulti\-Objective Score↑\\uparrowTestR2R^\{2\}↑\\uparrowRMAE↓\\downarrowMulti\-Adapter PPO0\.9861/0\.58480\.9722/ 0\.11080\.7821/0\.09760\.0286 / 0\.2029Early Stopping MI\-PPO0\.9674 / 0\.52880\.9347 / 0\.2706\-2\.7889 / \-1\.42560\.1185 / 0\.2676Transformer\-PPO0\.9569 / 0\.49550\.9139 / 0\.0200\-1\.0529 / \-0\.02760\.0861 / 0\.1861Standard PPO0\.9528 / 0\.51200\.9056 / 0\.10310\.7505 / 0\.06550\.0277/0\.1797PSO0\.7684 / 0\.51740\.5418 /0\.2815\-0\.0580 / \-13\.72240\.0510 / 0\.6312The dataset was collected using a custom\-built LIBS system designed for coal quality analysis on a transport belt, as illustrated in Figure[2](https://arxiv.org/html/2606.17476#S3.F2)\. The experimental setup consists of several key components working in concert to achieve high\-precision spectral data acquisition:
The system employs a 1064 nm pulsed Nd:YAG laser \(Quantel Brilliant B\) with a pulse energy of 100 mJ, pulse width of 8 ns, and repetition rate of 10 Hz\. The laser beam is directed through a two\-way dichroic mirror and focused onto the coal surface using a plano\-convex lens \(focal length: 100 mm\), creating a laser spot diameter of approximately 500μ\\mum\. The resulting plasma emission is collected by the same focusing lens, reflected by the dichroic mirror, and directed through a second focusing lens to an optical fiber interface\. The collected light is transmitted via a 600μm\\mu mcore diameter optical fiber to an Echelle spectrometer \(Andor Mechelle ME5000\) with a spectral range of 180\-800 nm and resolution of 0\.1 nm\.
We evaluate our PPO\-based wavelength selection framework using a comprehensive LIBS dataset collected from coal and steel quality analysis\. The steel dataset contains 21 samples with spectral measurements across 720 wavelengths and value labels of 21 trace elements\. The coal dataset contains 84 samples with spectral measurements across 101 wavelengths and value labels of 6 trace elements\. For the steel dataset, the raw spectral data with spatial dimensions \(pixels\) are first averaged to obtain a 720\-dimensional spectral vector per sample; the train/test split follows an 80/20 ratio with stratified sampling based on concentration bins to maintain consistent distribution\.
Figure 2:Schematic diagram of the LIBS system for coal qual ity analysis on a transport belt\. The system consists of a 1064 nm Nd:YAG laser, two\-way dichroic mirror, focusing lenses, optical fiber interface, Echelle spectrometer, delay generator, and computer for data acquisition and processing\. The laser beam \(red dashed line\) is focused onto the coal sample, gen erating plasma emission \(yellow solid line\) that is collected and analyzed to determine elemental composition\.All experiments were conducted on a workstation equipped with the following hardware and software specifications:
- •GPU: NVIDIA GeForce RTX 4090 Laptop GPU with 16 GB VRAM;
- •CPU: Intel Core i9\-13980HX Processor, 24 cores, 64 GB DDR5;
- •Software environments: CUDA Version 12\.1,Python Version 3\.10,PyTorch Version 2\.0\.1;
Hyperparameters\.Key hyperparameters used for reproducibility are summarized in Table[I](https://arxiv.org/html/2606.17476#S2.T1)\. For PPO\-based methods, we use discount factorγ=0\.99\\gamma=0\.99, clip rangeϵ=0\.2\\epsilon=0\.2, and sparsity penaltyα=2\.0\\alpha=2\.0in the reward; the agent and evaluator learning rates are set to1×10−31\\times 10^\{\-3\}and1×10−41\\times 10^\{\-4\}respectively, with batch size 32 and 5 PPO update steps per episode\. Multi\-Adapter PPO and Transformer\-PPO use model dimensiondmodel=128d\_\{\\text\{model\}\}=128, 8 attention heads, and \(for Transformer\-PPO\) 3 layers; Multi\-Adapter PPO uses 4 adapters\. For PSO/MI\-PSO, we use 20 particles, 50 iterations, inertia weightw=0\.7w=0\.7, acceleration coefficientsc1=c2=1\.5c\_\{1\}=c\_\{2\}=1\.5, and MI threshold 0\.3; the downstream regressor is linear regression with 3\-fold cross\-validation for evaluation\. Early stopping variants use patience 10 and minimum delta10−410^\{\-4\}\. All experiments use fixed random seeds \(e\.g\., 42\) and deterministic CuDNN where applicable to ensure reproducibility\.
### III\-BPerformance Comparison
Table[II](https://arxiv.org/html/2606.17476#S3.T2)presents the comprehensive performance rankings across all algorithms on both the Steel and Coal LIBS datasets\. Multi\-Adapter PPO consistently achieves the best or competitive performance in comprehensive score, multi\-objective score, and testR2R^\{2\}, demonstrating the effectiveness of the cross\-attention and multi\-adapter design for wavelength selection\.
Evaluation metrics\.The comprehensive scoreScS\_\{c\}balances prediction accuracy, feature efficiency, and model performance through the following formula:
Sc=\(1−Rn\)⋅wr\+\(1−Fn\)⋅wf\+Rn2⋅wr2S\_\{c\}=\(1\-R\_\{n\}\)\\cdot w\_\{r\}\+\(1\-F\_\{n\}\)\\cdot w\_\{f\}\+R^\{2\}\_\{n\}\\cdot w\_\{r2\}\(11\)whereScS\_\{c\}is the comprehensive score,RnR\_\{n\}is the normalized RMSE,FnF\_\{n\}is the normalized feature ratio, andRn2R^\{2\}\_\{n\}is the normalizedR2R^\{2\}score\. The multi\-objective scoreSmS\_\{m\}evaluates the degree to which predefined targets \(e\.g\., RMSE and feature ratio\) are met:
Sm=max\(0,1−RRt\)\+max\(0,1−FnFt\)S\_\{m\}=\\max\\left\(0,1\-\\frac\{R\}\{R\_\{t\}\}\\right\)\+\\max\\left\(0,1\-\\frac\{F\_\{n\}\}\{F\_\{t\}\}\\right\)\(12\)whereRRis the RMSE,FnF\_\{n\}is the normalized feature ratio, andRtR\_\{t\},FtF\_\{t\}are the target values\. HigherScS\_\{c\}andSmS\_\{m\}indicate better overall trade\-off between accuracy and sparsity; testR2R^\{2\}and RMAE \(relative mean absolute error\) reflect direct prediction quality\.
Ranking and algorithm comparison\.On the Steel dataset, Multi\-Adapter PPO attains the highest comprehensive score \(0\.9861\) and multi\-objective score \(0\.9722\), as well as the best testR2R^\{2\}\(0\.7821\), with RMAE 0\.0286\. Early Stopping MI\-PPO and Transformer\-PPO follow in comprehensive score \(0\.9674 and 0\.9569\), while Standard PPO achieves the lowest RMAE \(0\.0277\), indicating that different PPO variants excel in different aspects\. On the Coal dataset, Multi\-Adapter PPO again leads in comprehensive score \(0\.5848\) and testR2R^\{2\}\(0\.0976\); PSO reaches the highest multi\-objective score \(0\.2815\) and Standard PPO the best RMAE \(0\.1797\), but PSO exhibits poor and unstable testR2R^\{2\}\(−13\.7224\-13\.7224\), which limits its practical use\. Across both datasets, Multi\-Adapter PPO emerges as the most robust and best\-performing method overall\.
PPO variants vs\. PSO\.PPO\-based algorithms collectively outperform traditional PSO by an average of 28\.4% in comprehensive score and 79\.6% in multi\-objective score\. This gap is explained by PSO’s tendency toward premature convergence and lack of explicit sequential decision\-making: it searches in a fixed\-dimensional space without modeling the dependency between wavelength choices\. In contrast, the PPO variants treat wavelength selection as a sequential MDP and learn a policy that adapts to spectral structure, leading to better accuracy–efficiency trade\-offs\. Among the PPO variants, Multi\-Adapter PPO benefits from cross\-attention and multiple adapters to capture diverse spectral–target relationships, which contributes to its leading position in the comparison\.
### III\-CPrediction Accuracy
In terms of test set prediction accuracy, Standard PPO achieves the bestR2R^\{2\}performance among all algorithms on the Steel dataset, while PPO\-based algorithms collectively outperform PSO by an average of 12\.8% inR2R^\{2\}score and 45\.2% in RMAE across both datasets\. Multi\-Adapter PPO demonstrates superior performance through its innovative architecture: cross\-attention mechanisms, multiple specialized adapters, dual encoder design, and adaptive weighting\.
From Table[II](https://arxiv.org/html/2606.17476#S3.T2), on the Steel dataset Multi\-Adapter PPO attains testR2=0\.7821R^\{2\}=0\.7821and RMAE=0\.0286=0\.0286, and on the Coal dataset testR2=0\.0976R^\{2\}=0\.0976and RMAE=0\.2029=0\.2029, consistently ranking among the best\. In contrast, PSO yields negative or very lowR2R^\{2\}on both datasets \(−0\.0580\-0\.0580and−13\.7224\-13\.7224\), indicating poor generalization of the selected wavelengths to unseen samples\. The 45\.2% improvement in RMAE of PPO\-based methods over PSO reflects that the learned sequential policy selects more informative and stable wavelength subsets, which in turn leads to more accurate concentration prediction when coupled with a downstream regressor\.
## IVConclusion
This paper presents a comprehensive Multi\-Adapter PPO framework for wavelength selection in LIBS quantitative analysis\. We formulate wavelength selection as a sequential MDP problem and establish the theoretical advantages of PPO over traditional PSO through regret bound analysis\. The proposed Multi\-Adapter architecture leverages cross\-attention mechanisms and multiple specialized adapters to capture diverse spectral–target relationships, enabling adaptive feature selection that outperforms conventional optimization methods\.
## References
- \[1\]J\. Ding and L\. Fu\(2018\)A hybrid feature selection algorithm based on information gain and sequential forward floating search\.J Intell Comput9\(3\),pp\. 93\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[2\]J\. Feng, D\. Li, J\. Gu, X\. Cao, R\. Shang, X\. Zhang, and L\. Jiao\(2021\)Deep reinforcement learning for semisupervised hyperspectral band selection\.IEEE Transactions on Geoscience and Remote Sensing60,pp\. 1–19\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p3.1)\.
- \[3\]S\. Gilda and Z\. Slepian\(2019\)Automatic kalman\-filter\-based wavelet shrinkage denoising of 1d stellar spectra\.Monthly Notices of the Royal Astronomical Society490\(4\),pp\. 5249–5269\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[4\]E\. Harefa and W\. Zhou\(2021\)Performing sequential forward selection and variational autoencoder techniques in soil classification based on laser\-induced breakdown spectroscopy\.Analytical Methods13\(41\),pp\. 4926–4933\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[5\]T\. He, J\. Liang, H\. Tang, T\. Zhang, C\. Yan, and H\. Li\(2021\)Quantitative analysis of coal quality by mutual information\-particle swarm optimization \(mi\-pso\) hybrid variable selection method coupled with spectral fusion strategy of laser\-induced breakdown spectroscopy \(libs\) and fourier transform infrared spectroscopy \(ftir\)\.Spectrochimica Acta Part B: Atomic Spectroscopy178,pp\. 106112\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[6\]X\. He, Y\. Zhao, and F\. Li\(2023\)A new technique for baseline calibration of soil x\-ray fluorescence spectra based on enhanced generative adversarial networks combined with transfer learning\.Journal of Analytical Atomic Spectrometry38\(11\),pp\. 2486–2498\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[7\]X\. Jiang, F\. Li, Q\. Wang, J\. Luo, J\. Hao, and M\. Xu\(2021\)Baseline correction method based on improved adaptive iteratively reweighted penalized least squares for the x\-ray fluorescence spectrum\.Applied Optics60\(19\),pp\. 5707–5715\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[8\]Q\. Jiao, B\. Cai, M\. Liu, L\. Dong, M\. Hei, L\. Kong, and Y\. Zhao\(2024\)A three\-stage deep learning\-based training frame for spectra baseline correction\.Analytical Methods16\(10\),pp\. 1496–1507\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[9\]M\. Kazemzadeh, M\. Martinez\-Calderon, W\. Xu, L\. W\. Chamley, C\. L\. Hisey, and N\. G\. Broderick\(2022\)Cascaded deep convolutional neural networks as improved methods of preprocessing raman spectroscopy data\.Analytical Chemistry94\(37\),pp\. 12907–12918\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[10\]H\. Li, H\. Huang, and Z\. Qian\(2021\)Latency\-aware batch task offloading for vehicular cloud: maximizing submodular bandit\.In2021 IEEE 14th International Conference on Cloud Computing \(CLOUD\),pp\. 584–593\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[11\]H\. Li, X\. Liu, and Y\. Jin\(2026\)R3D: regional\-guided residual radar diffusion\.arXiv preprint arXiv:2601\.06465\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[12\]H\. Li and M\. F\. Zhuo\(2026\)Revisiting the scale loss function and gaussian\-shape convolution for infrared small target detection\.arXiv preprint arXiv:2604\.09991\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[13\]H\. Li\(2026\)Golden rpg: confidence\-adaptive region\-aware noise for compositional text\-to\-image generation\.arXiv preprint arXiv:2604\.25314\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[14\]P\. Liu, J\. Lin, S\. Wang, Y\. Xu, H\. Li, X\. Xie, S\. Wu, and H\. Li\(2025\)Learning to decide with just enough: information\-theoretic context summarization for cmdps\.arXiv preprint arXiv:2510\.01620\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[15\]W\. Ma, Z\. Yu, Z\. Lu, Q\. Ma, and S\. Yao\(2023\)A step\-by\-step classification method of coal and miscellaneous materials by laser\-induced breakdown spectroscopy\.At\. Spectrosc44\(3\),pp\. 160–168\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[16\]M\. J\. C\. Pontes, J\. Cortez, R\. K\. H\. Galvão, C\. Pasquini, M\. C\. U\. Araújo, R\. M\. Coelho, M\. K\. Chiba, M\. F\. de Abreu, and B\. E\. Madari\(2009\)Classification of brazilian soils by using libs and variable selection in the wavelet domain\.Analytica Chimica Acta642\(1\-2\),pp\. 12–18\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[17\]Y\. Qiao, Y\. Xiong, H\. Gao, X\. Zhu, and P\. Chen\(2018\)Protein\-protein interface hot spots prediction based on a hybrid feature selection strategy\.BMC bioinformatics19\(1\),pp\. 14\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[18\]A\. Saptoro, M\. O\. Tadé, and H\. Vuthaluru\(2012\)A modified kennard\-stone algorithm for optimal division of data for developing artificial neural network models\.Chemical Product and Process Modeling7\(1\)\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[19\]H\. Shang, Q\. Wu, J\. Wu, S\. Zhou, Z\. Wang, H\. Wang, and J\. Yin\(2024\)Study on breast cancerization and isolated diagnosis in situ by hof\-atr\-mir spectroscopy with deep learning\.Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy319,pp\. 124546\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[20\]J\. Shen, M\. Li, Z\. Li, Z\. Zhang, and X\. Zhang\(2022\)Single convolutional neural network model for multiple preprocessing of raman spectra\.Vibrational Spectroscopy121,pp\. 103391\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[21\]J\. R\. Vergara and P\. A\. Estévez\(2014\)A review of feature selection methods based on mutual information\.Neural computing and applications24\(1\),pp\. 175–186\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[22\]J\. Vrábel, P\. Pořízka, and J\. Kaiser\(2020\)Restricted boltzmann machine method for dimensionality reduction of large spectroscopic data\.Spectrochimica Acta Part B: Atomic Spectroscopy167,pp\. 105849\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[23\]L\. Wang, G\. Tolok, Y\. Fu, L\. Xu, L\. Li, H\. Gao, and Y\. Zhou\(2024\)Application and research progress of laser\-induced breakdown spectroscopy in agricultural product inspection\.ACS omega9\(23\),pp\. 24203–24218\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1),[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[24\]R\. Wei, C\. Garcia, A\. El\-Sayed, V\. Peterson, and A\. Mahmood\(2020\)Variations in variational autoencoders\-a comparative evaluation\.Ieee Access8,pp\. 153651–153670\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[25\]D\. Xu, Y\. Li, M\. Yin, X\. Li, H\. Li, and Z\. Qian\(2017\)A reliable resource scheduling for network function virtualization\.InInternational Conference on Security, Privacy and Anonymity in Computation, Communication and Storage,pp\. 251–260\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[26\]H\. Xu, Z\. Liu, W\. Cai, and X\. Shao\(2009\)A wavelength selection method based on randomization test for near\-infrared spectral analysis\.Chemometrics and Intelligent Laboratory Systems97\(2\),pp\. 189–193\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[27\]C\. Yan, J\. Liang, M\. Zhao, X\. Zhang, T\. Zhang, and H\. Li\(2019\)A novel hybrid feature selection strategy in quantitative analysis of laser\-induced breakdown spectroscopy\.Analytica chimica acta1080,pp\. 35–42\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.
- \[28\]C\. Yan\(2025\)A review on spectral data preprocessing techniques for machine learning and quantitative analysis\.iScience\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[29\]M\. Yuan, Q\. Zeng, J\. Wang, W\. Li, G\. Chen, Z\. Li, Y\. Liu, L\. Guo, X\. Li, and H\. Yu\(2021\)Rapid classification of steel via a modified support vector machine algorithm based on portable fiber\-optic laser\-induced breakdown spectroscopy\.Optical Engineering60\(12\),pp\. 124114–124114\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p1.1)\.
- \[30\]Y\. Yun, H\. Li, B\. Deng, and D\. Cao\(2019\)An overview of variable selection methods in multivariate analysis of near\-infrared spectra\.TrAC Trends in Analytical Chemistry113,pp\. 102–115\.Cited by:[§I](https://arxiv.org/html/2606.17476#S1.p2.1)\.Similar Articles
Token-weighted Direct Preference Optimization with Attention
Proposes AttentionPO, a token-weighted direct preference optimization method that uses attention from the LLM itself to estimate token weights, improving alignment performance on AlpacaEval, MT-Bench, and ArenaHard without requiring a separate reward model.
Spectral Souping: A Unified Framework for Online Preference Alignment
This paper introduces Spectral Souping, a framework for efficiently aligning LLMs with individual user preferences by discovering a universal spectral representation that enables merging of specialized policies at inference time without costly retraining.
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Introduces LambdaPO, a novel reinforcement learning framework that improves upon GRPO by decomposing advantage estimation into pairwise preference comparisons and adding a semantic density reward, achieving better performance on math reasoning tasks.
APPO: Agentic Procedural Policy Optimization
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
Proposes PPOW, a reinforcement learning framework for optimizing draft models in speculative decoding using window-level objectives and adaptive windowing, achieving significant speedups across multiple benchmarks.