Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

arXiv cs.LG Papers

Summary

Introduces Peak-Detector, a framework that uses instruction-tuned large language models for robust, cross-modal, and explainable peak detection in physiological signals like ECG, PPG, BCG, and BSG. The method transforms time-series data into a condensed 'peak-representation' format and is optimized via supervised fine-tuning followed by reinforcement learning with a multi-objective reward.

arXiv:2605.16452v1 Announce Type: new Abstract: Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:43 AM

# Peak-Detector: Explainable Peak Detection via Instruction–Tuned Large Language Models in Physiological Signal
Source: [https://arxiv.org/html/2605.16452](https://arxiv.org/html/2605.16452)
,Yida Zhang[0009\-0000\-5555\-3778](https://orcid.org/0009-0000-5555-3778)University of GeorgiaAthensGeorgiaUSA[yida\.zhang@uga\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Zixuan Zeng[0009\-0005\-7051\-7421](https://orcid.org/0009-0005-7051-7421)University of GeorgiaAthensGeorgiaUSA[zxzeng@uga\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Jiayu Chen[0009\-0003\-0904\-4114](https://orcid.org/0009-0003-0904-4114)University of GeorgiaAthensGeorgiaUSA[jiayu\.chen@uga\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Yingjian Song[0009\-0005\-5601\-4465](https://orcid.org/0009-0005-5601-4465)University of GeorgiaAthensGeorgiaUSA[ys63016@uga\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Yin Xiao[0009\-0005\-2247\-8042](https://orcid.org/0009-0005-2247-8042)Yixing People’s HospitalYixingJiangsu ProvinceChina[yinxiaosn@163\.com](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Nishan DongYixing People’s HospitalYixingJiangsu ProvinceChina[staff1820@yxph\.com](https://arxiv.org/html/2605.16452v1/mailto:[email protected])[0009\-0006\-4463\-9246](https://orcid.org/0009-0006-4463-9246),Junjie Lu[0000\-0001\-7547\-6378](https://orcid.org/0000-0001-7547-6378)Yixing People’s HospitalYixingJiangsu ProvinceChina[staff806@yxph\.com](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Younghoon Kwon[0000\-0002\-8152\-9170](https://orcid.org/0000-0002-8152-9170)University of WashingtonSeattleWashingtonUSA[yhkwon@uw\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Xiang Zhang[0000\-0001\-5097\-2113](https://orcid.org/0000-0001-5097-2113)University of North Carolina at CharlotteCharlotteNorth CarolinaUSA[xiang\.zhang@charlotte\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Jin Lu[0000\-0003\-1356\-0202](https://orcid.org/0000-0003-1356-0202)University of GeorgiaAthensGeorgiaUSA[jin\.lu@uga\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected]),Wenzhan Song[0000\-0001\-8174\-1772](https://orcid.org/0000-0001-8174-1772)University of GeorgiaAthensGeorgiaUSA[wsong@uga\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected])andFei Dou[0000\-0003\-4246\-8616](https://orcid.org/0000-0003-4246-8616)University of GeorgiaAthensGeorgiaUSA[fei\.dou@uga\.edu](https://arxiv.org/html/2605.16452v1/mailto:[email protected])

###### Abstract\.

Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram \(ECG\), Photoplethysmogram \(PPG\), Ballistocardiogram \(BCG\), and Bodyseismography \(BSG\), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability\. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability\. Conversely, deep learning\-based methods often lack interpretability, limiting transparency for expert verification and hindering expert–computer interaction\. To address these limitations, we introduce Peak\-Detector, a novel framework that leverages instruction\-tuned Large Language Models \(LLMs\) for robust, cross\-modal, and explainable peak detection\. A core innovation of our framework is a ”peak\-representation” technique that transforms time\-series data into a condensed format, preserving critical event information while significantly reducing signal length\. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data\. The model is optimized through a two\-stage process: supervised fine\-tuning \(SFT\) followed by reinforcement learning \(RL\) with a multi\-objective reward function\. The model’s self\-explanation capabilities are cultivated by fine\-tuning on a custom\-built Peak\-Explanation dataset\. Across four modalities—ECG, PPG, BCG, and BSG—spanning seven datasets \(six public benchmarks plus one real\-world cohort\), Peak\-Detector demonstrates strong cross\-modal performance, achieving best or tied\-best detection under clinically relevant temporal tolerance\. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis\. Together, these results indicate a transparent and generalizable framework for trustworthy peak analysis and cardiovascular metric extraction\.

Peak Detection, Large Language Models, Physiological Signal Processing, Cardiovascular Monitoring, Explainable AI, Instruction Tuning, Multimodal Signal Analysis

††copyright:cc††ccs:Human\-centered computing Human computer interaction \(HCI\)## 1\.INTRODUCTION

Continuous cardiovascular monitoring is an important component of mobile and ubiquitous health, supporting the longitudinal assessment of cardiac function across wearable, contactless, and clinical sensing settings\. Unobtrusive sensing modalities, particularly the Electrocardiogram \(ECG\), Photoplethysmogram \(PPG\), Ballistocardiogram \(BCG\), and Bodyseismography \(BSG\), have become cornerstone technologies for the longitudinal assessment of an individual’s health status\(Gordon,[1877](https://arxiv.org/html/2605.16452#bib.bib1); Kimet al\.,[2016](https://arxiv.org/html/2605.16452#bib.bib2); Songet al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib3); Penzelet al\.,[2016](https://arxiv.org/html/2605.16452#bib.bib4); Temko,[2017](https://arxiv.org/html/2605.16452#bib.bib5)\)\. The ability of these sensors to facilitate timely identification of cardiac arrhythmias—such as Premature Atrial Contractions \(PACs\), Premature Ventricular Contractions \(PVCs\), and Atrial Fibrillation \(AFib\)—is critical for preventing adverse health outcomes\. However, the efficacy of these monitoring systems hinges on the accurate and robust detection of key prominent points, or ”peaks,” within the collected physiological data\.

The primary challenge in this domain lies in the inherent heterogeneity of these signal modalities\. Each captures a different facet of the cardiac cycle through distinct physical principles, resulting in unique signal morphologies \(Fig[1](https://arxiv.org/html/2605.16452#S2.F1)\)\. The ECG records the heart’s electrical activity, with the prominent R\-peak signifying peak ventricular depolarization\(Setiawidayat and Rahman,[2018](https://arxiv.org/html/2605.16452#bib.bib6)\)\. The PPG waveform features a systolic peak, which represents the point of maximum blood volume in the peripheral tissue during ventricular contraction\(Castanedaet al\.,[2018](https://arxiv.org/html/2605.16452#bib.bib7)\)\. The BCG provides a contactless method by measuring the body’s subtle vibrations from cardiac ejections, characterized by a prominent J\-peak\(Liet al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib8); Suet al\.,[2009](https://arxiv.org/html/2605.16452#bib.bib9); Azhaginiyanet al\.,[2019](https://arxiv.org/html/2605.16452#bib.bib10)\)\. The BSG delivers richer information content by effectively containing features of both BCG \(whole\-body movement\) and SCG \(direct precordial vibrations\)\(Songet al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib3); Pitafiet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib67); Songet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib68)\)\. This diversity in signal origin and quality necessitates a versatile and resilient approach to peak detection\(Warmerdamet al\.,[2018](https://arxiv.org/html/2605.16452#bib.bib14); Funget al\.,[2004](https://arxiv.org/html/2605.16452#bib.bib15)\)\.

Existing peak\-detection methods remain limited in this regard\. Traditional signal\-processing approaches rely on modality\-specific domain knowledge, employing handcrafted features and heuristics that are often brittle and require meticulous parameter tuning\(Chakrabortyet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib11); Elgendiet al\.,[2013](https://arxiv.org/html/2605.16452#bib.bib12); Kuntamalla and Reddy,[2014](https://arxiv.org/html/2605.16452#bib.bib13)\)\. Consequently, these methods lack generalizability across different signal types\. While deep learning models offer greater adaptability by learning features directly from data, they are often criticized as ”black boxes\.”\(Kazemiet al\.,[2022](https://arxiv.org/html/2605.16452#bib.bib16); Sarkar and Etemad,[2021](https://arxiv.org/html/2605.16452#bib.bib17); Xu and Shuttleworth,[2024](https://arxiv.org/html/2605.16452#bib.bib18); Castelvecchi,[2016](https://arxiv.org/html/2605.16452#bib.bib19)\)\. This lack of transparency and interpretability creates a significant barrier to trust and adoption in critical clinical applications, where understanding the model’s reasoning is as important as its output\.

To address the limitations mentioned, we introduce Peak\-Detector, a framework that reformulates cardiac peak detection as a language\-guided reasoning task using LLMs\. By leveraging the advanced reasoning and linguistic capabilities of LLMs\(Biet al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib20); Wanget al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib21)\), our approach utilizes a novel Peak Representation that converts sparse, high\-frequency physiological signals into condensed symbolic sequences\. This shift effectively addresses the performance degradation typically associated with processing long, continuous numerical sequences\(Fonset al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib25)\)\. To bolster interpretability, we developed a Peak\-Explanation Dataset via a specialized data generation pipeline designed to enhance the model’s self\-explanatory capabilities\. The model is trained using a two\-stage instruction\-tuning strategy\. The first stage, Supervised Fine\-Tuning \(SFT\), establishes a reliable and correctly formatted output structure\. The second stage employs Reinforcement Learning \(RL\) with Group Relative Policy Optimization \(GRPO\), optimizing the model against a multi\-objective reward function\. This function is designed to concurrently enhance format validity, heart\-rate consistency, positional accuracy, and detection completeness, ensuring a robust and physiologically consistent output\.

We positionPeak\-Detectornot as a strict real\-time, always\-on edge model for battery\-constrained wearables, but as an explainable and higher\-fidelity analysis component within broader cardiovascular sensing workflows\. Under this framing, lightweight front\-end methods may be used for continuous screening, whilePeak\-Detectorcan be invoked selectively for flagged windows, uncertain segments, degraded signals, or retrospective summaries where interpretability and auditability are especially important\. This perspective is particularly relevant in ubiquitous sensing pipelines, where deployment decisions must balance signal quality, compute budget, and the need for trustworthy downstream metric extraction\.

In summary, this work makes the following contributions:

- •We introduce a modality\-agnosticPeak Representationthat summarizes physiological signals as timestamped local extrema with signal value, enabling token\-efficient, auditable reasoning\. It effectively compresses raw ECG, PPG, and BCG data by 87%, 97%, and 89%, respectively, while maintaining high fidelity for signal reconstruction with a correlation coefficient of 0\.94, 0\.94, and 0\.97\.
- •We developPeak\-Detector, A two\-stage SFT→\\rightarrowRL pipeline with a multi\-objective reward jointly optimizes detection accuracy, temporal consistency, and concise, structured rationales, which is supported by a scalable data\-generation pipeline that builds a self\-explainable peak\-detection dataset\.
- •Across ECG/PPG/BCG/BSG on six public datasets and one real\-world cohort, Peak\-Detector achieves best\-or\-tied\-best detection under clinically relevant tolerance with competitive HR/HRV errors, while providing step\-by\-step rationales\.

## 2\.RELATED WORK

### 2\.1\.Peak Detection in Cardiac Physiological Signals

The detection of fiducial points in cardiac physiological signals has primarily developed along two methodological lines: traditional signal\-processing approaches and data\-driven learning approaches\.

Signal\-Processing Approaches\.Classical methods rely on modality\-specific domain knowledge and handcrafted heuristics\. For ECG R\-peak detection, the Pan–Tompkins algorithm remains a foundational approach, using bandpass filtering, differentiation, squaring, and moving\-window integration to isolate the QRS complex\(Farihaet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib26); Sathyapriyaet al\.,[2014](https://arxiv.org/html/2605.16452#bib.bib28); Wuet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib27)\)\. Wavelet\-based methods provide a multi\-scale alternative for delineating salient peaks\(Banerjeeet al\.,[2012](https://arxiv.org/html/2605.16452#bib.bib29)\)\. In PPG analysis, systolic peaks are commonly detected through local\-maxima search or first\-derivative analysis with adaptive thresholds\(Lerddararadsamee and Jiraraksopakun,[2012](https://arxiv.org/html/2605.16452#bib.bib32); Kelleyet al\.,[2008](https://arxiv.org/html/2605.16452#bib.bib33); Liet al\.,[2010](https://arxiv.org/html/2605.16452#bib.bib35); Brüseret al\.,[2013](https://arxiv.org/html/2605.16452#bib.bib36); Shinet al\.,[2008](https://arxiv.org/html/2605.16452#bib.bib37)\)\. BCG J\-peak detection often requires more noise\-tolerant strategies, such as template matching, autocorrelation, or motion\-artifact reconstruction\(Alivaret al\.,[2019](https://arxiv.org/html/2605.16452#bib.bib38)\)\. Similarly, BSG analysis frequently estimates cardiac periodicity by identifying dominant peaks in the autocorrelation function \(ACF\)\(Songet al\.,[2023](https://arxiv.org/html/2605.16452#bib.bib69)\)\. Although effective in controlled settings, these approaches are often brittle under changing signal morphology, sensor characteristics, or degraded signal quality, and they typically require substantial modality\-specific tuning\.

Deep Learning Approaches\.To address the limitations of hand\-engineered heuristics, deep learning architectures—particularly Convolutional Neural Networks \(CNNs\)—have been applied to peak detection in ECG, PPG, and BCG signals\(Schranzet al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib39); Kazemiet al\.,[2022](https://arxiv.org/html/2605.16452#bib.bib16); Chenet al\.,[2023b](https://arxiv.org/html/2605.16452#bib.bib40)\)\. These methods, such as RPNet, cast peak detection as a supervised prediction problem and learn salient features directly from data, often improving robustness and accuracy relative to traditional pipelines\(Vijayaranganet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib41)\)\. However, their predictions are typically produced without explicit human\-readable justification\. As a result, although these models can be effective detectors, they often provide limited support for expert verification, audit, or failure analysis when signals are noisy or ambiguous\. This limitation motivates the exploration of explainable LLM\-based approaches that aim to combine competitive detection performance with structured, interpretable rationales for peak selection\.

### 2\.2\.Large Language Models in Physiological Time\-Series Analysis

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/introduction/peaks_with_label.png)Figure 1\.Physiology signals with labelled peaksAnnotated physiological signal examples with marked fiducial peaks\.Large Language Models \(LLMs\) are increasingly applied to physiological time\-series analysis, though research currently focuses on high\-level semantic tasks like sleep stage captioning in OpenTSLM\(Langeret al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib22)\)or arrhythmia diagnosis in ECG\-LM\(Yanget al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib45)\)\. Despite their power, these models often struggle with ”numeric grounding”—the precise localization of events like a specific peak at a distinct time index—because they treat signals holistically rather than as sequences requiring fine\-grained numeric inference\.

This gap is largely due to the architectural limitations of auto\-regressive models; as they generate tokens sequentially, cumulative errors can cause the model to ”drift out of distribution” during long\-form tasks\(Arbuzovet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib46)\)\. Consequently, numeric retrieval accuracy tends to degrade significantly as input length increases\(Fonset al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib25)\)\. To overcome this fundamental constraint, our approach distills physiological signals into a condensed Peak Representation, which shortens the input sequence for the LLM while preserving its essential informational content for precise temporal localization\.

### 2\.3\.Explainable AI for Physiological Signal Analysis

Explainable AI \(XAI\) is increasingly vital in medicine due to ethical requirements for patient transparency and the practical need for expert confidence in automated decision\-making\(Tjoa and Guan,[2020](https://arxiv.org/html/2605.16452#bib.bib47)\)\. Traditional XAI research primarily focuses on post\-hoc methods that seek to explain a model’s decision by attributing importance to specific input features\(Baehrenset al\.,[2010](https://arxiv.org/html/2605.16452#bib.bib42); Zeiler and Fergus,[2014](https://arxiv.org/html/2605.16452#bib.bib43)\)\. Techniques like Gradient SHAP \(GS\), for instance, explain predictions by computing the gradients of outputs with respect to points along an interpolation path from a baseline reference to the input\(Lundberg and Lee,[2017](https://arxiv.org/html/2605.16452#bib.bib44)\)\. While these techniques identify influential data points, they often yield complex interpretations that fail to capture a model’s underlying logic or provide a structured diagnostic rationale\. To address these limitations, the Peak\-Detector framework shifts from post\-hoc attribution to inherent, self\-explanatory design\. By utilizing a custom Peak\-Explanation Dataset and instruction\-tuning, the model generates direct, intuitive justifications for its detections, positioning the framework as a transparent partner in expert review workflows where reasoning is as critical as numerical accuracy\.

## 3\.METHOD

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/peak_representation/peak_detector.png)A diagram showing the Peak\-Detector Framework\.

Figure 2\.Peak\-Detector Framework### 3\.1\.Peak Detector Framework

This section details the design and implementation of Peak\-Detector, our proposed framework for accurate and explainable cross\-modal peak detection\. As illustrated in Fig\.[2](https://arxiv.org/html/2605.16452#S3.F2), the methodology is presented in a structured, block\-by\-block progression:

- •Block 1: Data Construction and Representation\.We commence by introducing the foundationalPeak Representation\(Section[3\.2](https://arxiv.org/html/2605.16452#S3.SS2)\), a technique that discretizes dense physiological time\-series into a compact, text\-based sequence optimized for LLM processing\. Building on this, we delineate the construction of thePeak\-Explanation Dataset\(Section[3\.3](https://arxiv.org/html/2605.16452#S3.SS3)\), a novel instruction\-tuning corpus synthesized via a semi\-automated pipeline that pairs signal representations with human\-aligned, LLM\-generated explanations\.
- •Block 2: Supervised Fine\-Tuning \(Stage 1\)\.As detailed in Section[3\.4\.1](https://arxiv.org/html/2605.16452#S3.SS4.SSS1), the first phase of ourTwo\-Stage Instruction Tuning Strategyfocuses on establishing a baseline capability\. In this block, a base LLM undergoes supervised fine\-tuning on the Peak\-Explanation Dataset to internalize the specific signal syntax and explanatory format\. This process yields theSFT Model, which functions as the reference policy for the subsequent optimization stage\.
- •Block 3: Reinforcement Learning \(Stage 2\)\.Following the SFT phase, the framework transitions to the optimization block \(Section[3\.4\.2](https://arxiv.org/html/2605.16452#S3.SS4.SSS2)\)\. Here, the SFT Model serves as the initialization for Group Relative Policy Optimization \(GRPO\)\. We optimize the model against a multi\-objective reward function—encompassing detection accuracy \(F1F\_\{1\}\) and physiological plausibility—to produce the fully refinedPeak\-Detector\.
- •Block 4: Inference and Evaluation\.The pipeline concludes with the deployment of the optimized Peak\-Detector on unseen data\. In this stage, the model processes raw signal representations to generate precise peak coordinates and diagnostic explanations, evaluated with standard peak\-detection and cardiovascular error metrics\.

### 3\.2\.Peak Representation: From Signal to Sequence

Our approach assumes that the most informative content in a physiological time\-series is primarily encoded aroundlocal extremathat delineate each cardiac cycle\. Intermediate samples can be reconstructed via interpolation between these key points, as illustrated in Fig\.[3](https://arxiv.org/html/2605.16452#S3.F3)\. This observation motivates transforming a dense numeric signal into a compact, structured sequence that an LLM can reason over\. The proposedPeak Representationoperationalizes this idea and serves as the input toPeak\-Detector\.

Step 1: Inclusive Local\-Extrema Extraction\.We perform alenient local\-extrema search\(Virtanenet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib48)\)on each preprocessed segmentx=\{xt\}t=1Tx=\\\{x\_\{t\}\\\}\_\{t=1\}^\{T\}to construct a comprehensive candidate set of peaks𝒞=\{ci\}\\mathcal\{C\}=\\\{c\_\{i\}\\\}by collecting a high\-recall set of both maxima and minima\. Each candidatecic\_\{i\}stores its temporal indextit\_\{i\}and normalized amplitudeaia\_\{i\}\. This “inclusive\-first” strategy ensures that all potential fiducial points are preserved for downstream reasoning\.

Step 2: Timestamp Transformation\.A key point of our representation is encoding each peak’s temporal position as a human\-readable timestamp rather than a raw numeric index\. Motivated by evidence that LLMs excel at calendrical and timestamp reasoning compared to long numeric sequences\(Fonset al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib25)\), each sample indextit\_\{i\}is converted into a high\-resolutionsynthetic calendar timestamp:

Ti​\(ti\)=YYYY\-MM\-DD​HH:MM:SST\_\{i\}\(t\_\{i\}\)=\\text\{YYYY\-MM\-DD\}~\\text\{HH:MM:SS\}

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/peak_representation/approximation.png)Figure 3\.Signal approximation by interpolating peaks\.Original physiological signal approximated by interpolation between selected peaks\.A fixed reference timeT0=2020\-01\-01 00:00:00T\_\{0\}=\\texttt\{2020\-01\-01 00:00:00\}is used, and absolute indices are computed across segments to prevent window\-local ambiguities\. The indextit\_\{i\}is transformed into elapsed seconds by dividing by the sampling frequencyFsF\_\{s\}\(i\.e\.,calendar\-second=ti/Fs\\text\{calendar\-second\}=t\_\{i\}/F\_\{s\}\), then formatted into the “HH:MM:SS” string\. For example, atFs=1F\_\{s\}=1Hz, a peak atti=97t\_\{i\}=97corresponds to 97 s→\\rightarrow“00:01:37”, which is appended to the reference date to yield the final timestamp\. Timestamps are reset at the beginning of each segment to maintain bounded token lengths while allowing the model to reason about relative temporal intervals\.

Step 3: Serialization into a Compact Sequence\.Each detected peak is represented as a key–value pair and serialized into a structured textual sequence\. All entries are arranged chronologically to preserve temporal coherence, forming a compact and information\-dense representation well suited for LLM processing\.

Overall, this design aims to: \(1\) compress long physiological waveforms into short, structure\-preserving token sequences; \(2\) retain the fiducial information necessary for accurate peak detection across ECG, PPG, BCG, and BSG; \(3\) express inputs in a format that LLMs handle effectively \(calendar/timestamp\-like symbols rather than long raw numerics\); and \(4\) remain modality\-agnostic and reproducible\.

### 3\.3\.Peak\-Explanation Dataset Construction

To jointly optimize accurate peak detection and interpretable reasoning, we construct thePeak\-Explanation Datasetas as\(Instruction, Input, Output\)triplets \(Fig\.[4](https://arxiv.org/html/2605.16452#S3.F4)\)\.Instructionspecifies \(i\) the target peak type for the given modality \(e\.g\., J\-peaks in BCG/BSG, R\-peaks in ECG, systolic peaks in PPG\), \(ii\) a concise guideline describing the morphological/temporal characteristics of the target peaks, and \(iii\) the required output format \(e\.g\.,\{J:\[t1,t2,\.\.\.\] Explanation: \.\.\.\}\)\.Inputis a compact textual serialization of the signal segment produced by our Peak Representation, which lists candidate extrema as\(Date, Value\)pairs between<TS\_START\>and<TS\_END\>\.Outputis a composite target: the ground\-truth peak timestamps in the same representation, concatenated with anExplanationdescribing why these peaks are selected and why common distractors are rejected\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/dataset_analysis/conversion.png)Figure 4\.A representative sample from the Peak\-Explanation Dataset \(BCG Arrhythmia subset\)\. The figure illustrates the instruction\-tuning triplet structure, comprising the task instruction, the input peak representation, and the ground\-truth output with explanatory reasoning\.A representative sample from the Peak\-Explanation Dataset##### Explanation generation\.

We build this dataset using a semi\-automated pipeline\. First, raw signals and ground truth peaks are preprocessed and converted into the textual Input sequence as shown in section[3\.2](https://arxiv.org/html/2605.16452#S3.SS2)\. To generate the corresponding explanation in Output, we leverage a powerful ”teacher” LLM \(e\.g\., GPT\-4o\)\. As illustrated in Figure[5](https://arxiv.org/html/2605.16452#S3.F5), we prompt explicitly specifies the*signal modality/dataset*\(e\.g\., ECG/PPG/BCG\) and the corresponding*target\-peak definition*\(e\.g\., ECG R\-peaks, PPG systolic peaks, BCG J\-peaks\), together with \(1\) the candidate peak list produced by our Peak Representation and \(2\) the ground\-truth peaks \(converted into the same timestamp/value format for consistency\)\. Conditioned on these modality\-specific instructions, the teacher LLM outputs a*structured rational*to justify why the provided ground\-truth target peaks correspond to physiologically meaningful peaks among the candidate extrema, and to explain why other candidates are rejected via: \(1\) morphology/amplitude evidence consistent with the modality, \(2\) temporal/IBI consistency, \(3\) amplitude threshold\-based rejection of spurious candidates, and \(4\) surrounding\-wave context \(e\.g\., P/QRS/T for ECG, systolic upstroke and dicrotic notch for PPG, and I/J/K complexes for BCG\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/x1.png)Figure 5\.Teacher LLM\-supervised data construction and structured response template\.Pipeline diagram showing teacher LLM generation of structured peak\-detection explanations\.![Refer to caption](https://arxiv.org/html/2605.16452v1/source/Two-strategy_training/SFT.png)\(a\)SFT illustration
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/Two-strategy_training/RL.png)A brief description of the RL\.png image\.

\(b\)GRPO illustration

Figure 6\.Training strategies comparisonComparison of supervised fine\-tuning and GRPO reinforcement learning training workflows\.

### 3\.4\.Two\-Stage Instruction Tuning Strategy

To effectively train Peak\-Detector for both robust performance and explainability across diverse physiological signals, we employ a sophisticated two\-stage instruction tuning strategy\. This approach synergistically combines Supervised Fine\-Tuning \(SFT\) with Reinforcement Learning \(RL\) optimization, leveraging the strengths of each paradigm\. We utilize a pre\-trained open\-source LLM \(specificallyQwen2\.5for our experiments\) as the base model, fine\-tuning it with our meticulously constructed Peak\-Explanation Dataset\.

#### 3\.4\.1\.Stage 1: Supervised Fine\-Tuning \(SFT\)

This stage focuses on adapting the general\-purpose base model to the specific syntax of physiological signals\. As illustrated in the SFT schematic \(Fig\.[6\(a\)](https://arxiv.org/html/2605.16452#S3.F6.sf1)\), the process operates through the following pipeline:

Input Processing and Embedding\.The workflow begins with theRaw Textinputs from Peak\-Explanation Dataset—comprising the Instruction, Signal Input, and Ground Truth Output\. These texts are tokenized and mapped to dense vectors viaLook\-up Tablesthat access the model’s Vocabulary\. Crucially,Positional Embeddingsare added to these token embeddings to preserve the temporal order of the signal sequence, ensuring the model retains time\-series awareness\.

Decoder\-Only Architecture\.The combined embeddings are processed through a stack ofNNDecoder\-only layers\. Within each layer,Multi\-Head Masked Self\-Attentionallows the model to attend to relevant historical context \(signal antecedents\) while preventing information leakage from future tokens\. This is followed byFeed Forward Layersthat process the attended features to capture non\-linear morphological patterns\.

Next\-Token Prediction\.The high\-dimensional hidden states from the final transformer layer are projected into the vocabulary space via aLinear Layer\. ASoftmaxfunction is then applied to generate a probability distribution over the vocabulary, determining the likelihood of thePredicted Token\.

Optimization Objective\.The training objective is to minimize the discrepancy between the predicted distribution and theActual Token\. We employ a standard cross\-entropy loss function:

\(1\)L=−∑k=1Kyk​log⁡\(pk\)L=\-\\sum\_\{k=1\}^\{K\}y\_\{k\}\\log\(p\_\{k\}\)whereKKis the vocabulary size,yky\_\{k\}is the binary indicator for the ground\-truth token, andpkp\_\{k\}is the predicted probability\. The calculated loss drives theUpdatestep, adjusting the model parameters via backpropagation to maximize the likelihood of correct peak detections and valid explanations\.

This SFT stage serves three primary objectives: \(1\) it stabilizes the LLM’s learning dynamics for numerical data; \(2\) it enforces valid output formatting; and \(3\) it establishes a foundational capability for peak localization before the subsequent Reinforcement Learning stage\.

#### 3\.4\.2\.Stage 2: Reinforcement Learning Optimization\.

Following Supervised Fine\-Tuning \(SFT\), the model undergoes a second stage of optimization using Reinforcement Learning \(RL\) with Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib50)\)\. As illustrated in the RL schematic \(Fig\.[6\(b\)](https://arxiv.org/html/2605.16452#S3.F6.sf2)\), this process refines the model’s policy at thesequence levelthrough a structured, block\-by\-block progression:

Rollout and Reference\.The workflow initiates with thePolicy Model\(initialized from the SFT model\) sampling a group ofG=16G=16distinct outputs, or “Rollouts,” for a single input instruction\. Simultaneously, a frozenReference Modelprocesses the same input\. This parallel execution is critical for computing the KL\-divergence constraint, which acts as a regularization anchor to prevent the policy from drifting too far from the initial linguistic distribution established during SFT\.

Reward Evaluation\.Each generated rollout is evaluated against a multi\-objectiveReward Function\(RtotalR\_\{\\text\{total\}\}\)\. Unlike standard RL which often uses a single scalar, our framework computes a composite score derived from four distinct criteria: Format Compliance, Detection Accuracy, Count Completeness, and Heart\-Rate Consistency\. This specific design ensures the model optimizes for both syntax validity and physiological fidelity\.

Group Advantage EstimationA core innovation of GRPO is the elimination of the separate “Critic” value model typically required in algorithms like PPO\. Training a Critic for long\-context physiological sequences is computationally expensive and prone to high variance\. Instead, GRPO uses thegroup averageas a dynamic baseline\. The intuition is straightforward: for a given input queryqq, we compare the reward of a specific rolloutoio\_\{i\}against the average reward of its peers in the same batch\. Ifoio\_\{i\}performs better than the average, it yields a positive advantage \(Ai,t\>0A\_\{i,t\}\>0\), signaling the model to increase the probability of that sequence\.

The optimization objective maximizes this expected advantage while enforcing strict stability constraints:

\(2\)𝒥G​R​P​O​\(θ\)=𝔼​\[q∼P​\(Q\),\{oi\}i=1G∼πθo​l​d​\(O\|q\)\]1G​∑i=1G1\|oi\|​∑t=1\|oi\|\{min⁡\[rt​\(θ\)​A^i,t,clip​\(rt​\(θ\),1−ε,1\+ε\)​A^i,t\]⏟Trust Region Constraint−βK​L𝔻K​L\[πθ\|\|πr​e​f\]⏟Language Regularization\}\\begin\{split\}\\mathcal\{J\}\_\{GRPO\}\(\\theta\)=&\\mathbb\{E\}\\left\[q\\sim P\(Q\),\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{old\}\}\(O\|q\)\\right\]\\\\ &\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\left\\\{\\underbrace\{\\min\\left\[r\_\{t\}\(\\theta\)\\hat\{A\}\_\{i,t\},\\text\{clip\}\\left\(r\_\{t\}\(\\theta\),1\-\\varepsilon,1\+\\varepsilon\\right\)\\hat\{A\}\_\{i,t\}\\right\]\}\_\{\\text\{Trust Region Constraint\}\}\-\\underbrace\{\\beta\_\{KL\}\\mathbb\{D\}\_\{KL\}\\left\[\\pi\_\{\\theta\}\|\|\\pi\_\{ref\}\\right\]\}\_\{\\text\{Language Regularization\}\}\\right\\\}\\end\{split\}Equation[2](https://arxiv.org/html/2605.16452#S3.E2)incorporates three critical mechanisms designed to balance thestability\-plasticity dilemma:

1. \(1\)Probability Ratiort​\(θ\)r\_\{t\}\(\\theta\):Measures how much more likely an action is under the new policy compared to the old, tracking the magnitude of the update\.
2. \(2\)Trust Region Clipping \(ε=0\.2\\varepsilon=0\.2\):This prevents the model from making distinctively large updates based on a single batch\. We selectedε=0\.2\\varepsilon=0\.2as a conservative bound to avoid “catastrophic forgetting,” where a single noisy batch could otherwise destroy the model’s learned representations\.
3. \(3\)KL\-Penalty \(βK​L=0\.04\\beta\_\{KL\}=0\.04\):Physiological signal processing requires the model to strictly adhere to the syntax learned during SFT\. The termβK​L\\beta\_\{KL\}acts as a “syntax anchor\.” We empirically setβK​L=0\.04\\beta\_\{KL\}=0\.04, which is sufficiently high to preserve linguistic coherence but low enough to allow the model to shift itsreasoninglogic toward better peak detection\.

UpdateThe model uses these advantages to Compute Loss and perform a policy Update, increasing the probability of rollouts that outperform their peers while ensuring the policy does not drift too far from the original SFT distribution\.

Multi\-Objective Reward DesignDefining “success” in physiological peak detection is multifaceted\. To guide the optimization through this complex landscape, we derive a composite reward functionRtotalR\_\{\\text\{total\}\}structured as ahierarchy of clinical needs:

\(3\)Rtotal=α⋅Rformat\+β⋅Rdetection\+γ⋅Rcomplete\+δ⋅RHRR\_\{\\text\{total\}\}=\\alpha\\cdot R\_\{\\text\{format\}\}\+\\beta\\cdot R\_\{\\text\{detection\}\}\+\\gamma\\cdot R\_\{\\text\{complete\}\}\+\\delta\\cdot R\_\{\\text\{HR\}\}The coefficients are set toα=0\.1\\alpha=0\.1,β=0\.6\\beta=0\.6,γ=0\.15\\gamma=0\.15, andδ=0\.15\\delta=0\.15\. These values enforce a specific learning priority: syntax is a baseline constraint \(α\\alpha\), detection accuracy is the primary clinical objective \(β\\beta\), while count and heart\-rate metrics \(γ,δ\\gamma,\\delta\) serve as physiological regularizers\.

The individual components are defined as follows:

- •Format Compliance \(RformatR\_\{\\text\{format\}\}\):This is a binary reward whereRformat=1R\_\{\\text\{format\}\}=1if the output adheres to the specified syntax \(Fig\.[4](https://arxiv.org/html/2605.16452#S3.F4)\) and0otherwise\. We assign a low weight \(α=0\.1\\alpha=0\.1\) because this acts as a gating mechanism\. Once the model learns the syntax \(typically early in training\), this reward saturates\. A higher weight would risk the model optimizing for valid but empty structures\.
- •Detection Accuracy \(RdetectionR\_\{\\text\{detection\}\}\):As the primary objective, this is quantified by theF1F\_\{1\}\-score between predicted and ground\-truth peaks using a strict 30 ms tolerance window\. TheF1F\_\{1\}\-score is chosen over simple accuracy to robustly handle class imbalance \(sparse peaks vs\. abundant background signal\)\.
- •Count Completeness \(RcompleteR\_\{\\text\{complete\}\}\):To penalize missed or hallucinated peaks, we employ an exponential decay function based on the total peak count difference: \(4\)Rcomplete=exp⁡\(−\|Npred−Ngt\|\)R\_\{\\text\{complete\}\}=\\exp\\left\(\-\|N\_\{\\text\{pred\}\}\-N\_\{\\text\{gt\}\}\|\\right\)We selected an exponential form rather than a linear penalty \(e\.g\., Mean Absolute Error\) to strictly bound the reward to\(0,1\]\(0,1\]\. This design prevents gradient explosions in early training stages where the model might predict wildly incorrect counts \(e\.g\., 0 or 100 peaks\), which would otherwise destabilize the policy update\.
- •Heart\-Rate Consistency \(RHRR\_\{\\text\{HR\}\}\):Physiological plausibility is assessed via the normalized error in derived heart rate: \(5\)RHR=exp⁡\(−2⋅\|H​Rpred−H​Rgt\|H​Rgt\)R\_\{\\text\{HR\}\}=\\exp\\left\(\-2\\cdot\\frac\{\|HR\_\{\\text\{pred\}\}\-HR\_\{\\text\{gt\}\}\|\}\{HR\_\{\\text\{gt\}\}\}\\right\)The scaling factor of−2\-2was empirically chosen to sharpen the reward curve, ensuring that only high\-precision estimates \(within≈5%\\approx 5\\%error\) yield significant positive reinforcement\. This forces the model to refine its peak localization to meet clinical\-grade heart rate standards\.

## 4\.EXPERIMENT SETUP AND EVALUATION

### 4\.1\.Datasets and Preprocessing

Datasets\. To thoroughly evaluate the cross\-modal and explanatory capabilities of Peak\-Detector, we conducted experiments on a comprehensive suite of seven publicly available physiological signal datasets\. In this experiment, we focus on detecting the prominent peak, such as R\-peak for ECG, systolic peak for PPG, and J\-peak for BCG/BSG\. The ECG datasets used were the MIT\-BIH Arrhythmia Database\(Moody and Mark,[1992](https://arxiv.org/html/2605.16452#bib.bib51)\)and the Incart Arrhythmia Database\(Tihonenkoet al\.,[2007](https://arxiv.org/html/2605.16452#bib.bib52)\)\. For PPG analysis, we utilized the BIDMC Database\(Pimentelet al\.,[2016](https://arxiv.org/html/2605.16452#bib.bib53)\)and the Capnobase Database\(Karlenet al\.,[2013](https://arxiv.org/html/2605.16452#bib.bib54)\)\. Then, the BCG experiments were performed on the Kansas Database\(Carlsonet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib55)\), the BCG Arrhythmia Database\(Zhanet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib56)\)\. Finally, for BSG experiments, we self\-collected ICU database\. Detailed descriptions of each dataset are provided in Appendix[A](https://arxiv.org/html/2605.16452#A1)\.

Signal Preprocessing\. Prior to analysis, all physiological signals undergo a standardized preprocessing pipeline to ensure consistency and mitigate noise\. Each continuous recording is first segmented into fixed\-length windows of 1000 samples\. A fourth\-order Butterworth bandpass filter is then applied, with a passband frequency ranging from 0\.6 Hz to 15 Hz\. This step effectively removes baseline wander and high\-frequency noise while preserving the characteristic morphology of cardiac events\. Following filtration, a z\-score normalization is performed on each segment, transforming the signal to have a mean of zero and a standard deviation of one\. This standardization mitigates amplitude variations across different subjects and recording sessions, thereby enhancing the robustness of the peak detection algorithms\.

Subject\-Independent Cross\-Validation\.To assess the robustness and generalizability of our model, we implemented a 4\-fold subject\-independent cross\-validation scheme\. For each dataset, subjects were randomly partitioned into four disjoint subsets of approximately equal size\. The experimental procedure was conducted iteratively: in each iteration, three folds constituted the training set, while the remaining fold served as the unseen testing set for the learning\-based models\. Signal\-processing baselines, which do not require training, were evaluated directly on the corresponding test folds to ensure comparable evaluation conditions\. To demonstrate model stability, final performance metrics are reported as the mean±\\pmstandard deviation across the four folds\. Furthermore, to ascertain statistical significance, we performed a two\-tailed Welch’stt\-test\(West,[2021](https://arxiv.org/html/2605.16452#bib.bib88)\)comparing the proposed Peak\-Detector against the best\-performing baseline model \(defined by the lowest MAE\)\.

### 4\.2\.Evaluation Metrics

To rigorously assess the performance of Peak\-Detector, we employ a comprehensive set of evaluation metrics that capture both the absolute accuracy of peak localization and the physiological consistency of the derived cardiac rhythm\. The primary metric for evaluating accuracy is the F1\-score, Precision, and Recall\. Following established conventions in physiological signal processing\(Reisset al\.,[2019](https://arxiv.org/html/2605.16452#bib.bib57); Zuoet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib58)\), a predicted peak is considered a True Positive \(TP\) if it falls within a specified tolerance radius of±30\\pm 30ms around a ground\-truth peak, which is smaller than common used threshold\(Zuoet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib58); Reisset al\.,[2019](https://arxiv.org/html/2605.16452#bib.bib57); Elgendiet al\.,[2013](https://arxiv.org/html/2605.16452#bib.bib12)\)\. This strict window ensures that only highly precise detections contribute positively to the F1\-score\. In addition to this, we also evaluate the model’s ability to accurately derive key parameters: heart rate \(HR\) and heart rate variability \(HRV\)\. These are calculated from the sequence of detected peak\-to\-peak intervals \(e\.g\., R\-R, systolic\-systolic, J\-J\), and We report the Mean Absolute Error \(MAE\) for both HR and HRV compared to ground\-truth values in beats per minute \(bpm\) and milliseconds \(ms\), respectively\. To mitigate the bias of fixed temporal constraints, we implemented an adaptive tolerance mechanism \(Appendix[H](https://arxiv.org/html/2605.16452#A8)\) using a window of±5%\\pm 5\\%of the local Inter\-Beat Interval \(IBI\)\.

### 4\.3\.Baselines

To comprehensively benchmark the performance of Peak\-Detector, we compared our model against a diverse set of nine state\-of\-the\-art baselines\. This selection encompasses both modality\-specific signal\-processing approaches and generalizable deep learning models\. Detailed descriptions of each methodology are provided in Appendix[B](https://arxiv.org/html/2605.16452#A2)\. Comprehensive details regarding the experimental setup for Peak\-Detector are documented in Appendix[C](https://arxiv.org/html/2605.16452#A3)\. All baseline methods were evaluated using identical preprocessing pipelines and matching tolerance criteria \(±30\\pm 30ms fixed window\) to ensure fair comparison\.

The signal\-processing baselines leverage domain\-specific heuristics and include thePan\-Tompkinsalgorithm\(Farihaet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib26)\)andNabian’s method\(Nabianet al\.,[2018](https://arxiv.org/html/2605.16452#bib.bib59)\)for ECG;Elgendi’s algorithm\(Elgendi,[2012](https://arxiv.org/html/2605.16452#bib.bib60)\)and the multi\-scale approach byBishop\(Bishop and Ercole,[2018](https://arxiv.org/html/2605.16452#bib.bib61)\)for PPG; andPino’s method\(Pinoet al\.,[2017](https://arxiv.org/html/2605.16452#bib.bib62)\)alongside the segmentation\-based algorithm byChoi\(Choiet al\.,[2009](https://arxiv.org/html/2605.16452#bib.bib63)\)for BCG and BSG\. For these methods, the algorithms were applied directly to the preprocessed signals, and the detected peaks were compared against the ground truth to compute evaluation metrics\.

The deep learning baselines represent advanced data\-driven approaches, includingCNN\-SWT\(Yunet al\.,[2022](https://arxiv.org/html/2605.16452#bib.bib64)\)for ECG,1D\-UNet\+\+\(Zhouet al\.,[2021](https://arxiv.org/html/2605.16452#bib.bib65)\)for BCG, andFR\-Net\(Chenet al\.,[2023c](https://arxiv.org/html/2605.16452#bib.bib66)\), a specialized CNN\-Transformer hybrid network developed for fetal R\-peak detection\. In contrast to the heuristic methods, these deep learning models output a probability distribution of peak locations rather than discrete coordinates\. Consequently, we employed a local extrema search\(Virtanenet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib48)\)to extract the final peak positions from the generated probability maps\.

### 4\.4\.Results

Table 1\.Peak Detection Performance Comparison across Signal Modalities and Baselines\. Lower values are better for MAE and MAPE \(HR\(bpm\), HRV\(ms\)\), higher values are better for F1, Pre, and Rec\. Best performance in each metric \(excluding Pre/Rec\) isbold, second best isunderlined\. Standard deviations are shown in a smaller font\. Superscripts on Peak\-Detector indicate significance against the second\-best method:\*p<0\.05p<0\.05,\*\*p<0\.01p<0\.01\. Thepp\-value column is retained for reference\.This section presents a comprehensive evaluation of Peak\-Detector’s performance against established baselines across six diverse physiological datasets\. The quantitative results, encompassing peak detection accuracy \(F1\-score, Precision, Recall\) and physiological consistency \(HR MAE, HRV MAE\), are summarized in Table[1](https://arxiv.org/html/2605.16452#S4.T1)\. Comprehensive ablation studies—evaluating the influence of specific training stages and providing performance benchmarks against state\-of\-the\-art online LLMs—are detailed in Appendix[G](https://arxiv.org/html/2605.16452#A7)\.

Traditional signal\-processing algorithms demonstrate highly variable performance, often exhibiting significant degradation when applied outside their target modality\. For instance, Elgendi’s algorithm, while effective for PPG, fails to generalize to BCG datasets, confirming the limitations of methods reliant on hand\-crafted, modality\-specific features\. Conversely, deep learning baselines \(e\.g\., FR\-Net, 1D\-UNet\+\+\) demonstrate stronger generalization; however, they remain opaque “black\-boxes,” lacking the interpretability required for clinical trust\.

Our proposed Peak\-Detector addresses these challenges, demonstrating performance that is either state\-of\-the\-art or statistically comparable to specialized deep learning models across modalities:

Electrocardiography \(ECG\):On the MIT\-BIH and Incart datasets, Peak\-Detector maintains robust detection capabilities\. While specialized deep learning baselines like FR\-Net achieved statistically lower error rates \(p<0\.001p<0\.001\) on these cleaner signals, Peak\-Detector’s performance remains highly competitive for clinical utility\. This suggests that while specialized architectures may offer marginal gains in low\-noise environments, our generative approach provides a viable, high\-performance alternative without requiring architecture\-specific tuning\. We provide a further analysis of failure cases for this dataset in Appendix[F](https://arxiv.org/html/2605.16452#A6)\.

Photoplethysmography \(PPG\):In the optical domain, Peak\-Detector demonstrates exceptional efficacy\. Notably, on the BIDMC dataset, it achieves state\-of\-the\-art performance with a statistically significant reduction in HR MAE \(0\.350\.35bpm vs\.0\.710\.71bpm,p<0\.01p<0\.01\) compared to the best baseline\. On the Capnobase dataset, our model similarly achieves the lowest mean HR MAE \(0\.350\.35bpm\) and highest F1\-score \(0\.99200\.9920\)\. Although the difference on Capnobase did not reach statistical significance \(p\>0\.05p\>0\.05\), the results confirm that Peak\-Detector is, at a minimum, statistically equivalent to the top\-performing specialized models in the PPG domain\.

Ballistocardiography \(BCG\) and Seismocardiography \(BSG\):On the challenging mechanical signals \(Kansas, Arrhy, BSG ICU\), Peak\-Detector demonstrates its strongest advantage\. For the Arrhythmia dataset, our model achieved a statistically significant improvement in HRV MAE \(36\.8536\.85ms vs\.57\.4457\.44ms,p<0\.05p<0\.05\) against the best baseline, highlighting its superior ability to track heart rate variability in irregular rhythms\. On the large\-scale BSG ICU dataset, Peak\-Detector achieved the lowest mean HR MAE \(8\.178\.17bpm\), numerically outperforming the top deep learning baseline \(CNN\-SWT:8\.438\.43bpm\)\. While this specific difference was not statistically significant \(p\>0\.05p\>0\.05\), the consistent top\-tier performance across these mechanical modalities confirms that Peak\-Detector matches the stability of specialized architectures in high\-noise environments while offering the distinct advantage of explainability\.

To evaluate the generalization capabilities and robustness of our framework beyond single\-source data, we performed a series of cross\-dataset experiments\. Detailed results are documented in Appendix[L](https://arxiv.org/html/2605.16452#A12)\.

### 4\.5\.Qualitative Analysis

To provide a qualitative assessment of performance in challenging scenarios, Figure[7](https://arxiv.org/html/2605.16452#S4.F7)illustrates the comparative detection results on representative segments of ECG, PPG, BCG, and BSG signals\. This visual analysis highlights the practical failure modes of baseline algorithms and underscores the robustness of our proposed method\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/interpretation/ECG_R_peak.png)\(a\)ECG R\-Peak Detection
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/interpretation/PPG_peak.png)\(b\)PPG Systolic Peak Detection
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/interpretation/BCG_J_peak.png)\(c\)BCG J\-Peak Detection
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/interpretation/BSG_J_peak.png)\(d\)BSG J\-Peak Detection

Figure 7\.Qualitative visualization of peak detection performance across challenging segments of \(a\) ECG with T\-wave interference, \(b\) PPG with noise artifacts, and \(c\) BCG with arrhythmia \(d\) BSG in ICU\.Four qualitative signal plots comparing peak detection performance across ECG, PPG, BCG, and BSG examples\.The ECG example \(Figure[7\(a\)](https://arxiv.org/html/2605.16452#S4.F7.sf1)\) features a prominent, extended T\-wave, a common source of error for heuristic\-based methods\. As observed, several signal\-processing algorithms \(Pan\-Tompkins, Bishop, Pino, Choi\) erroneously classify this T\-wave as a true R\-peak due to its significant amplitude\. Other methods, like Elgendi, correctly identify the R\-peak’s general location but exhibit notable positional inaccuracies\. While the deep learning baselines \(FR\-Net, CNN\-SWT\) are more robust to the morphological interference, they still show slight deviations from the precise R\-peak location, a challenge that Peak\-Detector overcomes\.

The PPG segment \(Figure[7\(b\)](https://arxiv.org/html/2605.16452#S4.F7.sf2)\) demonstrates the impact of signal quality degradation\. In the initial clean portion of the signal, all methods perform reasonably well\. However, as the signal becomes corrupted by noise in the latter half, the performance of signal\-processing methods deteriorates significantly; Pan\-Tompkins fails to detect subsequent peaks entirely, and most other heuristic methods miss the crucial peak around the 700\-sample mark\. While the data\-driven approaches maintain better performance in the noisy region, Peak\-Detector is the only method to correctly identify the systolic peak within a morphologically distorted area around the 600\-sample mark, showcasing its superior resilience to artifacts\.

Then, the BCG arrhythmia example \(Figure[7\(c\)](https://arxiv.org/html/2605.16452#S4.F7.sf3)\) highlights the difficulty of analyzing signals with irregular timing and morphology\. Traditional algorithms, relying on fixed assumptions about rhythm and shape, struggle significantly; Pan\-Tompkins and Elgendi exhibit considerable over\-detection, while Bishop and Pino show incorrect localization\. In contrast, data\-driven approaches, and particularly Peak\-Detector, successfully identify the correct J\-peak positions even amidst arrhythmia and reduced signal clarity\. These qualitative observations across diverse modalities reinforce the quantitative findings, visually confirming that Peak\-Detector’s contextual reasoning provides superior robustness against morphological anomalies, noise, and rhythm disturbances\.

Finally, the BSG ICU example \(Figure[7\(d\)](https://arxiv.org/html/2605.16452#S4.F7.sf4)\) illustrates performance on a highly challenging BSG signal, characterized by the elevated noise levels typical of an ICU environment\. As observed, traditional algorithms are prone to over\-detection, while deep learning models tend towards under\-detection in this scenario\. Peak\-Detector, however, correctly identifies all ground\-truth peaks, demonstrating that its contextual reasoning is highly effective at distinguishing true cardiac events from artifacts, even under conditions of both high noise and arrhythmic variability\.

### 4\.6\.Explanation Evaluation

Evaluating the quality of machine\-generated explanations is a nuanced challenge\. To provide a holistic assessment, we designed a multi\-faceted evaluation framework that assesses the generated explanations across five key dimensions\. The results of this evaluation, summarized in the radar plot in Figure[8](https://arxiv.org/html/2605.16452#S4.F8), demonstrate the high quality and reliability of Peak\-Detector’s explanatory capabilities\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/interpretation/radar_plot.png)Figure 8\.Radar Plot of Explanation EvaluationRadar chart showing explanation evaluation scores across five interpretability dimensions\.The evaluation framework assesses the Peak\-Detector’s interpretability across five key dimensions, demonstrating high quality and clinical readiness\.FaithfulnessandRobustnessquantify the alignment between qualitative explanations and quantitative signal features, specifically verifying assertions against input data and testing stability under perturbations like clipping or noise; on both metrics, the model achieved a score of 98\.1, proving its reasoning is strictly data\-grounded and resilient to signal variations\.ClarityandUtilitywere evaluated via user studies to measure intelligibility and practical efficacy, with both receiving scores of 88, confirming the framework’s accessibility and its ability to bolster clinician confidence\. Finally,Completeness—measured by the inclusion of morphological, temporal, amplitude, and contextual analyses—scored 96\.3, highlighting the model’s robust instruction\-following capabilities and its ability to generate comprehensive, multi\-faceted rationales\.

Analysis of Generated Explanations\.Table[13](https://arxiv.org/html/2605.16452#A15.T13)presents a side\-by\-side comparison between an explanation generated by the ”teacher” LLM and one produced by our fine\-tuned Peak\-Detector, revealing several key insights into our model’s interpretability\. The analysis confirms that Peak\-Detector demonstrates a strong adherence to the predefined explanatory format, consistently integrating multiple analytical dimensions into its reasoning\. These dimensions include morphological characteristics \(peak shape\), temporal relationships \(physiological intervals\), signal quality criteria \(signal\-to\-noise ratio and amplitude\), and waveform context\. Furthermore, the model’s textual rationale is demonstrably faithful to the underlying signal data, as the provided justification directly supports the final peak selections, showing a strong alignment between its analysis and prediction\. Finally, a notable trade\-off between conciseness and verbosity is observed; the explanation from Peak\-Detector is more succinct than that of the larger teacher model, while still effectively communicating the core logic for its decisions\.

Emergent Analytical Capabilities\.A significant advantage of leveraging a pre\-trained LLM is its potential to generalize its learned knowledge beyond the explicit fine\-tuning task\. To explore this, we evaluated Peak\-Detector on two zero\-shot tasks not present in the training data: fine\-grained peak classification and procedural heart rate calculation\. The results, shown in Figure[9](https://arxiv.org/html/2605.16452#S4.F9), demonstrate that our training cultivates a model with a foundational understanding of the underlying signal, rather than just a narrow peak detection function\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/interpretation/Interaction.png)\(a\)Zero\-Shot Peak Classification
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/interpretation/Interaction2.png)\(b\)Zero\-Shot HR Calculation

Figure 9\.Demonstration of Peak\-Detector’s emergent analytical capabilities on zero\-shot tasks\. \(a\) The model correctly classifies a non\-J\-peak and hypothesizes its identity\. \(b\) The model executes a multi\-step procedure to accurately calculate heart rate from the detected peaks\.Two zero\-shot task examples showing peak classification and heart\-rate calculation interactions\.As shown in Figure[9\(a\)](https://arxiv.org/html/2605.16452#S4.F9.sf1), when queried to classify a specific point in the waveform, the model correctly identifies it as a non\-J\-peak\. Critically, it goes further by hypothesizing that the point may correspond to a K\-wave\. This demonstrates a latent understanding of the complete IJK complex morphology for Peak\-Detector\. Furthermore, Figure[9\(b\)](https://arxiv.org/html/2605.16452#S4.F9.sf2)showcases the model’s capacity for multi\-step procedural reasoning\. When tasked with calculating the heart rate, Peak\-Detector spontaneously executes a correct analytical workflow: It first identifies all J\-peaks, then calculates the individual peak\-to\-peak intervals, and finally averages these intervals to derive the mean heart rate\.

These emergent capabilities highlight that our framework does not merely produce a static peak detector\. Instead, it cultivates an interactive analytical tool with a deeper, more flexible understanding of cardiac signals, showcasing the broader potential for LLMs to serve as versatile partners in physiological data analysis\.

## 5\.ANALYSIS

### 5\.1\.Validation of Peak Representation

To validate the Peak Representation framework, we evaluated its performance across three critical metrics: data compression efficiency, signal reconstruction fidelity, and peak set completeness\. Efficiency was quantified via aData Retention Ratio, defined as the cardinality of the Peak Representation relative to the original signal length\. As illustrated in Figure[10](https://arxiv.org/html/2605.16452#S5.F10)\(a\), the framework achieves substantial compression across modalities, with mean retention ratios of 13\.2% for ECG and 11\.3% for BCG; notably, the smoother PPG waveforms required as little as 1\.2% of the original data\. To ensure information preservation, we assessed reconstruction fidelity using cubic spline interpolation\. The resultingPearson correlation coefficients\(Figure[10](https://arxiv.org/html/2605.16452#S5.F10)\(b\)\) consistently exceeded 0\.90 across all datasets, confirming the retention of vital temporal dynamics\. Finally, we verified the completeness of the initial extraction viaProminent Peak Recall\. As shown in Figure[10](https://arxiv.org/html/2605.16452#S5.F10)\(c\), the framework achieved a near\-perfect recall \(minimum 0\.9956\), ensuring that all ground\-truth peaks are captured\. This transforms the detection task into a high\-confidence filtering and reasoning challenge for the LLM\. Detailed tabulations are provided in Section[5\.2](https://arxiv.org/html/2605.16452#S5.SS2)\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/compression_analysis/compression.png)\(a\)Compression Ratio
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/compression_analysis/compress.png)\(b\)Correlation Coefficient
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/compression_analysis/recall_bar.png)\(c\)Preliminary Peak Recall

Figure 10\.Validation of the Peak Representation\. \(a\) Data Retention Ratio, showing the percentage of data points remaining after transformation\. \(b\) Pearson correlation between original and reconstructed signals, demonstrating high signal fidelity\. \(c\) Recall of prominent peaks in the initial candidate set, confirming completeness\.Three plots showing compression ratio, reconstruction correlation, and preliminary peak recall for the peak representation\.Figure[11](https://arxiv.org/html/2605.16452#S5.F11)qualitatively illustrates the Peak Representation by comparing original physiological signals with reconstructions derived from sparse peak data\. For ECG signals \(Fig[11\(a\)](https://arxiv.org/html/2605.16452#S5.F11.sf1)\), the reconstruction accurately tracks the high\-frequency QRS complex while capturing subtle baseline fluctuations, accounting for the modality’s moderate compression ratio\. Conversely, the PPG waveform \(Fig[11\(b\)](https://arxiv.org/html/2605.16452#S5.F11.sf2)\), characterized by smoother morphology and information density concentrated at systolic and diastolic extrema, achieves a superior compression ratio with high visual fidelity\. The BCG signal \(Fig[11\(c\)](https://arxiv.org/html/2605.16452#S5.F11.sf3)\) represents an intermediate complexity; the representation effectively preserves the multi\-component inflection points \(e\.g\., I, J, K waves\) essential to its contour\. These visualizations confirm that the framework successfully retains critical morphological characteristics across diverse signal types, establishing a robust foundation for LLM\-based contextual analysis\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/compression_analysis/ECG_reconstruction.png)\(a\)ECG Reconstruction
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/compression_analysis/PPG_Reconstruction.png)\(b\)PPG Reconstruction
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/compression_analysis/BCG_reconstruction.png)\(c\)BCG Reconstruction

Figure 11\.Qualitative comparison of original signals \(blue\) and their reconstructions \(orange\) from the sparse Peak Representation\. The visualizations demonstrate high fidelity across modalities: \(a\) An ECG segment from MIT\-BIH, where high\-frequency components are captured\. \(b\) A PPG segment from BIDMC, showing excellent approximation from only major peaks and troughs\. \(c\) A BCG segment from the Kansas dataset, where the complex multi\-phasic waveform is effectively outlined\.Original and reconstructed ECG, PPG, and BCG waveforms overlaid to compare reconstruction fidelity\.To isolate the contribution of the LLM’s sequential reasoning capabilities, we performed a comparative analysis against established benchmarks\. Using the raw features from the Peak Representation \(e\.g\., peak index and amplitude\), we trained several traditional machine learning classifiers—Random Forest\(Rigatti,[2017](https://arxiv.org/html/2605.16452#bib.bib70)\), Logistic Regression\(LaValley,[2008](https://arxiv.org/html/2605.16452#bib.bib71)\), and XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2605.16452#bib.bib72)\)\. These models were tasked with the same binary classification objective: distinguishing true prominent peaks from candidate noise or artifacts\. This experimental design ensures that any performance gains observed in our framework can be specifically attributed to the LLM’s contextual processing rather than the input features themselves\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/compression_analysis/comparison.png)Figure 12\.F1\-score comparison between Peak\-Detector and traditional machine learning classifiers trained on features from the Peak Representation\. The performance gap is minimal for simpler PPG signals but substantial for morphologically complex ECG and BCG signals, highlighting the value of the LLM’s sequential reasoning\.Bar chart comparing F1 scores for Peak\-Detector and traditional classifiers across physiological datasets\.Results in Figure[12](https://arxiv.org/html/2605.16452#S5.F12)illustrate a performance gap dictated by signal modality\. For morphologically stable PPG signals \(BIDMC, Capnobase\), traditional classifiers perform competitively, suggesting that local features within the Peak Representation suffice for systolic peak identification\. Conversely, Peak\-Detector significantly outperforms baselines on complex ECG and BCG signals\. This divergence underscores that resolving R\-peaks from T\-waves, or J\-peaks within an IJK complex, requires a holistic understanding of sequential patterns and morphological context\. Ultimately, these findings confirm that the framework’s primary advantage is the LLM’s unique capacity for contextual reasoning, which is indispensable for interpreting high\-complexity physiological waveforms\.

### 5\.2\.Sensitivity Analysis of Peak Representation Parameters

The effectiveness of our Peak Representation is inherently linked to the minimum horizontal distance parameter used in the initial peak extraction\. To understand this trade\-off between data compression and signal fidelity, we conducted a sensitivity analysis by varying this distance, with the results presented in Figure[13](https://arxiv.org/html/2605.16452#S5.F13)\. The analysis reveals a clear inverse relationship: increasing the distance improves compression \(i\.e\., lowers the data retention ratio\) but degrades signal reconstruction fidelity, as evidenced by rising MAE and RMSE values and a corresponding drop in correlation\. Crucially, the prominent peak recall remains consistently high, confirming that the essential target peaks are robustly captured regardless of the compression level\. The impact of this trade\-off varies significantly by modality; the smoother PPG signals are least affected, while the complex, multi\-phasic BCG signals are highly sensitive, showing a sharp drop in correlation from 0\.9582 to 0\.4616\. Based on this analysis, we selected a final parameter value that ensures reconstruction correlation remains above 0\.90, striking an effective balance between substantial data reduction and the preservation of critical information for the LLM\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/compression_ratio.png)\(a\)Data Retention Ratio
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/MAE.png)\(b\)MAE
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/RMSE.png)\(c\)RMSE
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/Correlation_vs_length.png)\(d\)Correlation
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/recall_distance.png)\(e\)Peak Recall

Figure 13\.Sensitivity analysis of Peak Representation metrics as a function of the minimum horizontal distance parameter in peak extraction\. As distance increases, data retention decreases, while MAE, RMSE, correlation and prominent peak recall vary, demonstrating the trade\-off between compression and signal fidelity across different physiological modalities\.Sensitivity plots showing how compression, error, correlation, and peak recall change with the peak\-distance parameter\.To evaluate the impact of the Peak Representation Parameter on the performance of the Peak Detector, we controlled the minimum distance parameter across the set\{0,2,5,10\}\\\{0,2,5,10\\\}using the Arrhythmia Dataset\. This modulation resulted in average peak counts of1,0001,000,209209,162162, and130130, respectively\. We evaluated the detection accuracy using Heart Rate \(HR\) Mean Absolute Error \(MAE\) and Heart Rate Variability \(HRV\) MAE, as shown in Fig\.[14](https://arxiv.org/html/2605.16452#S5.F14)\. The results indicate that performance consistently improves as the average number of detected peaks decreases\. This trend highlights the benefit of primarily excluding noisy peaks to enhance signal fidelity\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/scaling_plot_peaks.png)Figure 14\.Impact of the distance parameter on detection error\. The plot demonstrates the trade\-off between the average number of peaks retained and the resulting MAE for HR and HRV\.Line plot showing how retained peak count affects HR and HRV mean absolute error\.
### 5\.3\.Impact of Model Scale

To investigate the relationship between model capacity and task performance, we conducted a scaling analysis by fine\-tuning several variants of our base LLM architecture, with parameter counts ranging from 0\.5 billion to 7 billion\. The performance of each model variant was evaluated on the challenging BCG Arrhythmia Dataset\.

As depicted in Figure[15](https://arxiv.org/html/2605.16452#S5.F15), the results demonstrate a clear and consistent scaling law: performance across all metrics improves monotonically with the number of model parameters\. Specifically, as the model size increased from 0\.5B to 7B, the HR MAE decreased substantially from 3\.26 to 0\.59, and the HRV MAE dropped from 83\.85 to 9\.91\. Concurrently, the F1\-score rose from 0\.8150 to 0\.9701, driven by marked improvements in both precision \(from 0\.8210 to 0\.9678\) and recall \(from 0\.8091 to 0\.9725\)\.

This strong positive correlation suggests that larger models possess a greater capacity to learn the complex, non\-linear patterns and subtle morphological cues inherent in arrhythmic physiological signals\. The concurrent improvement in both detection accuracy \(F1\-score\) and physiological consistency \(HR/HRV MAE\) indicates that increased model scale enhances not just pattern matching but also a deeper, more context\-aware understanding of the underlying cardiac rhythm\. This scaling trend suggests that the performance of the Peak\-Detector framework is not yet saturated and could be further enhanced by leveraging even larger foundational models, affirming that the principles of scaling laws extend effectively to this specialized domain of time\-series analysis\. In contrast, the deep neural networks under evaluation fail to demonstrate performance gains commensurate with model scaling \(see Appendix[I](https://arxiv.org/html/2605.16452#A9)for a detailed scaling analysis\)\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/scaling/scaling_hr.png)\(a\)HR MAE
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/scaling/scaling_HRV.png)\(b\)HRV MAE
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/scaling/scalling_score.png)\(c\)F1, Precision, and Recall

Figure 15\.The impact of model scaling on Peak\-Detector performance\. As model size increases from 0\.5B to 7B parameters, we observe consistent improvements in \(a\) HR MAE, \(b\) HRV MAE, and \(c\) F1/precision/recall under the same protocol\. This demonstrates meaningful headroom and enables a practical compute–accuracy trade\-off while retaining the same Peak\-Detector interface \(peak outputs \+ rationales\)\.Three plots showing model\-size scaling effects on heart\-rate error, heart\-rate\-variability error, and detection metrics\.
### 5\.4\.Computational Complexity Analysis

Table 2\.Computational complexity and throughput analysis\. Throughput is measured in segments per second \(SPS\)\. The analysis highlights the trade\-off: Peak\-Detector is slower than lightweight numerical algorithms but orders of magnitude faster than general\-purpose LLMs and human interpretation\.Building on the scaling findings, a comprehensive complexity analysis was conducted to evaluate the computational viability of our model within expert review workflows\. Table[2](https://arxiv.org/html/2605.16452#S5.T2)contrasts Peak\-Detector \(∼\\sim3B parameters\) against a spectrum of baselines by quantifying throughput in signal segments per second \(SPS\)\. The results reveal a distinct performance hierarchy: lightweight numerical algorithms, such as Pan\-Tompkins, and specialized deep learning models like CNN\-SWT achieve extremely high throughput \(\>1,000\>1,000SPS\) by prioritizing processing speed over semantic depth\. In contrast, general\-purpose proprietary LLMs prove prohibitively slow \(≈0\.07\\approx 0\.07SPS\) for large\-scale clinical analysis\.Peak\-Detectoroccupies a critical middle ground; although its computational overhead results in a lower throughput \(3\.571 SPS\) than pure signal processing methods, the Peak Representation strategy allows it to operate over50×50\\timesfaster than general\-purpose LLMs\. Furthermore, it outperforms the human benchmark of 0\.074 SPS\(Winterset al\.,[2022](https://arxiv.org/html/2605.16452#bib.bib89)\)by approximately48×48\\times\. While this model scale may limit real\-time edge processing, the system remains highly efficient for asynchronous tasks—such as automated reporting and retrospective analysis—providing a unique combination of accuracy and explainability that lighter models cannot match\.

## 6\.DISCUSSIONS AND LIMITATIONS

### 6\.1\.Real\-world Impact for Cardiovascular Metrics in Ubiquitous Sensing

Peak detection is not an end goal in mobile and ubiquitous health; it is a prerequisite for reliable cardiovascular metrics such as IBI/HR/HRV and downstream inferences \(e\.g\., stress monitoring, sleep assessment, arrhythmia screening, and longitudinal health tracking\) as shown in Appendix[N](https://arxiv.org/html/2605.16452#A14)\. In real\-world deployments, the primary difficulty is not only average accuracy on curated benchmarks, but*robust, auditable operation under non\-ideal conditions*: heterogeneous sensors and modalities, motion artifacts and context\-dependent noise, variable contact quality, intermittent missing data, and user/context shifts over time\. Improved statistics are informative, but a complete picture for field use requires understanding how the system supports trustworthy metric extraction and practical deployment choices\.

Peak\-Detector advances field use through three system\-relevant capabilities that directly support ubiquitous cardiovascular metric extraction\. First,modality\-agnostic generalizationreduces per\-device calibration and engineering overhead when integrating new wearable/contactless sensors \(ECG/PPG/BCG/BSG\) into ubiquitous pipelines\. Second,explainable, audit\-ready outputsprovide structured rationales that enable rapid expert verification and error triage when signals degrade \(e\.g\., motion/artifacts or arrhythmia\), addressing silent failure modes that can otherwise propagate into downstream HR/HRV estimates\. Third, itscompute–accuracy trade\-offsuggests a tiered deployment model in which lightweight front\-end methods perform continuous low\-power screening, while Peak\-Detector is used selectively for flagged windows, uncertain segments, noisy intervals, or retrospective summaries that require higher\-fidelity peak localization and interpretable justification\.

Under this interpretation, we do not view Peak\-Detector as a strict real\-time, always\-on model for edge execution on battery\-constrained wearables\. Rather, we consider its most appropriate role to be that of a selectively invoked second\-stage analyzer within broader ubiquitous cardiovascular sensing workflows, particularly in settings where interpretability, auditability, and higher\-confidence peak localization are important\. The present results therefore support Peak\-Detector primarily as a system\-oriented algorithmic component for trustworthy cardiovascular metric extraction\. At the same time, long\-duration free\-living validation and integration into fully end\-to\-end ubiquitous health systems remain important directions for future work\.

### 6\.2\.Generalizability Across Physiological Modalities

The development of a generalizable and explainable peak\-detection framework addresses an important systems challenge in mobile and ubiquitous health\. In many cardiovascular sensing pipelines, peak detection is not the final objective, but a prerequisite for deriving reliable intermediate measures such as inter\-beat interval \(IBI\), heart rate \(HR\), and heart rate variability \(HRV\), which in turn support higher\-level applications including rhythm monitoring, sleep\-related analysis, stress assessment, and longitudinal health tracking\. From this perspective, the practical value ofPeak\-Detectorlies less in any single downstream task than in its potential to provide a common, interpretable front end for trustworthy cardiovascular metric extraction across heterogeneous sensing modalities\.

This role is particularly relevant because different sensing modalities have traditionally required separate peak\-detection strategies tailored to their own signal morphology and noise characteristics\. ECG, PPG, BCG, and BSG each expose distinct fiducial structures and are often deployed in different sensing contexts, which has historically led to fragmented algorithm design and repeated modality\-specific tuning\. Our results suggest that a modality\-agnostic framework such asPeak\-Detectorcan reduce this fragmentation by supporting a shared analysis interface across heterogeneous signals, thereby lowering the engineering overhead associated with selecting, adapting, and calibrating separate algorithms for each device or modality\.

Under this interpretation, the contribution ofPeak\-Detectoris not that it directly solves all downstream clinical or wellness tasks, but that it provides a more unified and auditable foundation for the upstream peak\-analysis step on which such tasks depend\. This property is especially valuable for ubiquitous sensing systems that must incorporate new sensors, operate across multiple modalities, or balance accuracy with interpretability in broader cardiovascular monitoring workflows\.

### 6\.3\.Explainability and Clinical Trust

Many deep learning baselines\(Zuoet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib58); Reisset al\.,[2019](https://arxiv.org/html/2605.16452#bib.bib57); Biswaset al\.,[2019](https://arxiv.org/html/2605.16452#bib.bib83); Schranzet al\.,[2024](https://arxiv.org/html/2605.16452#bib.bib39)\)for peak detection produce only discrete peak locations \(or a probability map\) without an explicit, human\-readable justification of*why*those peaks are selected\. While post\-hoc interpretation tools \(e\.g\., saliency maps / attention visualizations\) can highlight influential regions, they typically do not provide structured, physiologically grounded reasoning that supports rapid expert auditing\.

In contrast, Peak\-Detector is designed to output not only peak timestamps but also concise rationales aligned with domain criteria\. Concretely, Peak\-Detector summarizes evidence across four complementary dimensions: morphology \(shape consistency\), temporal consistency \(interval constraints\), amplitude dominance \(relative prominence\), and waveform context \(local neighborhood patterns\) as shown in Appendix[O](https://arxiv.org/html/2605.16452#A15)\. These cues align with how practitioners reason about fiducial points, making model outputs easier to audit and validate than predictions that provide no rationale\.

Under this framing, the value of explainability is not merely descriptive, but operational: structured rationales can assist audit, failure analysis, and rapid verification when signal quality degrades or when candidate peaks are ambiguous\. We further support this process through the visualization\-assisted audit interface described in Appendix I, which provides a practical mechanism for human\-in\-the\-loop inspection in the evaluated settings\. Taken together, these design choices position explainability inPeak\-Detectoras a tool for improving transparency and trustworthiness in upstream peak analysis, rather than as a post\-hoc add\-on to otherwise opaque predictions\.

### 6\.4\.LLMs Hallucination

Large Language Model \(LLM\) hallucination\(Huanget al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib73)\)describes the generation of content that remains semantically plausible while being factually ungrounded\. Within the domain of physiological signal analysis, hallucination risks are primarily localized to the Peak\-Explanation text and typically manifest in two forms: \(a\) Objective Inconsistencies, where generated timestamps, amplitudes, or intervals contradict the input metadata; and \(b\) Semantic Inaccuracies, characterized by logically unsupported diagnostic rationale\. To quantify these risks, we conducted a rigorous verification process on the Peak\-Explanation dataset \(see Appendix[K](https://arxiv.org/html/2605.16452#A11)\)\. Our empirical results demonstrate zero factual errors and a negligible incidence of ambiguous expressions, suggesting that the dataset is highly resistant to hallucinatory outputs\.

### 6\.5\.The Signal\-to\-Sequence Paradigm

The success of our approach is fundamentally rooted in thePeak Representation, which serves as a powerful abstraction layer between raw numerical data and the LLM\. While this transformation is intuitive for peak detection—where salient information is locally concentrated—its implications extend far beyond this specific task\. This signal\-to\-sequence paradigm represents a transformative approach for a wide range of time\-series analysis tasks where information is sparse or event\-driven, including anomaly detection, EEG seizure detection, and sleep stage classification\. By converting irregular time\-series data into a symbolic, token\-based format, this method naturally aligns with the architectural strengths of LLMs, potentially unlocking their sophisticated reasoning capabilities for problems previously dominated by specialized numerical models\. Furthermore, this strategy effectively distills critical temporal information, offering a robust alternative to traditional frequency\-domain analyses and mitigating the impact of missing data in volatile time\-series environments\(Chenet al\.,[2023a](https://arxiv.org/html/2605.16452#bib.bib86); Zhanget al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib87)\)\. From a ubiquitous computing perspective, this event\-centric representation is well matched to resource\-constrained, event\-driven sensing pipelines, where only sparse salient events need to be transmitted, reviewed, or fused across devices\.

### 6\.6\.Limitations and Future Directions

Real\-time and on\-device constraints\.While Peak\-Detector\-3B is practical for offline/batch analysis on workstation/server hardware \(Sec\. 5\.4; Table 2\), it remains slower than lightweight CNN baselines \(e\.g\., FR\-Net\) and may be unsuitable for strict real\-time, always\-on, on\-device processing in unconstrained field settings\. Appendix H and K quantify the memory–accuracy and device feasibility trade\-offs; developing more resource\-aware variants that preserve explanation quality remains an important direction\. Our intended use case is therefore not strict always\-on edge execution, but selective second\-stage analysis for windows that have already been surfaced by lightweight front\-end detectors or downstream review pipelines\.

Scope of target peaks\.This study focuses on the dominant fiducial point per modality \(ECG R, PPG systolic, BCG/BSG J\), which is sufficient for the primary metrics emphasized here \(IBI/HR/HRV\)\. Extending the framework to multi\-fiducial detection \(e\.g\., ECG P/T waves\) requires datasets with consistent, high\-quality multi\-wave annotations and is left for future work\.

Generalization beyond the benchmark\.Although we evaluate across multiple datasets and report cross\-dataset analyses \(Appendix J\), fully in\-the\-wild deployments introduce additional sources of domain shift \(device diversity, long\-term drift, context transitions\)\. Future work will include longitudinal field evaluations and end\-to\-end integration into ubiquitous health workflows\. Although our robustness analyses under controlled noise in Appendix[E](https://arxiv.org/html/2605.16452#A5)provide supporting evidence, we do not claim that they substitute for fully free\-living validation, where motion artifacts, posture transitions, contact variability, and long\-term domain drift remain open challenges\.

Explanation quality at scale\.We provide verification analyses for teacher\-generated rationales \(Appendix I\)\. While objective consistency checks are applied at scale, semantic quality is assessed via sampling due to the cost of exhaustive manual review\. Improving automatic semantic validation and studying how explanations impact expert confidence and decision\-making in real workflows are promising directions\.

## 7\.CONCLUSION

Peak\-Detector presents a specialized framework that reformulates cardiac peak detection as a language\-guided reasoning task\. By leveraging Large Language Models, it moves beyond rigid handcrafted heuristics and opaque deep learning pipelines toward a more transparent and auditable approach to peak analysis\. The framework combines a compact Peak Representation, a two\-stage instruction\-tuning strategy, and a dedicated Peak\-Explanation Dataset to jointly support accurate peak localization and structured self\-explanations\. Across seven benchmarks spanning ECG, PPG, BCG, and BSG, the results show that Peak\-Detector achieves strong cross\-modal performance while offering interpretable rationales that can support verification and error analysis\.

Rather than positioning Peak\-Detector as a strict real\-time, always\-on model for battery\-constrained wearable deployment, we view its current strength as a selectively invoked, higher\-fidelity second\-stage analyzer within broader ubiquitous cardiovascular sensing workflows\. In this role, Peak\-Detector is particularly well suited for flagged windows, uncertain segments, and retrospective summaries where explainability, auditability, and higher\-confidence peak localization are especially important\. Overall, this work provides a system\-oriented foundation for trustworthy cardiovascular metric extraction across heterogeneous sensing modalities, while long\-duration free\-living validation and more resource\-aware edge deployment remain important directions for future work\.

## 8\.ACKNOWLEDGMENT

Portions of the text in this manuscript were refined using OpenAI ChatGPT to improve clarity, phrasing, and grammar\. The tool was not used to generate ideas, analyses, figures, tables, or any scientific content\. The authors take full responsibility for the integrity and accuracy of all content in this manuscript\.

## References

- A\. Alivar, C\. Carlson, A\. Suliman, S\. Warren, P\. Prakash, D\. E\. Thompson, and B\. Natarajan \(2019\)Motion artifact detection and reduction in bed\-based ballistocardiogram\.Ieee Access7,pp\. 13693–13703\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- M\. L\. Arbuzov, A\. A\. Shvets, and S\. Beir \(2025\)Beyond exponential decay: rethinking error accumulation in large language models\.arXiv preprint arXiv:2505\.24187\.Cited by:[§2\.2](https://arxiv.org/html/2605.16452#S2.SS2.p2.1)\.
- S\. Azhaginiyan, M\. Anish, M\. K\. Shivaranjan, and M\. Ganesan \(2019\)Denoising of bcg signal using multi resolution analysis\.In2019 5th International conference on advanced computing & communication systems \(ICACCS\),pp\. 1005–1008\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- D\. Baehrens, T\. Schroeter, S\. Harmeling, M\. Kawanabe, K\. Hansen, and K\. Müller \(2010\)How to explain individual classification decisions\.The Journal of Machine Learning Research11,pp\. 1803–1831\.Cited by:[§2\.3](https://arxiv.org/html/2605.16452#S2.SS3.p1.1)\.
- S\. Banerjee, R\. Gupta, and M\. Mitra \(2012\)Delineation of ecg characteristic features using multiresolution wavelet analysis method\.Measurement45\(3\),pp\. 474–487\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- X\. Bi, D\. Chen, G\. Chen, S\. Chen, D\. Dai, C\. Deng, H\. Ding, K\. Dong, Q\. Du, Z\. Fu,et al\.\(2024\)Deepseek llm: scaling open\-source language models with longtermism\.arXiv preprint arXiv:2401\.02954\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p4.1)\.
- S\. M\. Bishop and A\. Ercole \(2018\)Multi\-scale peak and trough detection optimised for periodic and quasi\-periodic neuroscience data\.InIntracranial Pressure & Neuromonitoring XVI,pp\. 189–195\.Cited by:[4th item](https://arxiv.org/html/2605.16452#A2.I1.i4.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p2.1)\.
- D\. Biswas, L\. Everson, M\. Liu, M\. Panwar, B\. Verhoef, S\. Patki, C\. H\. Kim, A\. Acharyya, C\. Van Hoof, M\. Konijnenburg,et al\.\(2019\)CorNET: deep learning framework for ppg\-based heart rate estimation and biometric identification in ambulant environment\.IEEE transactions on biomedical circuits and systems13\(2\),pp\. 282–291\.Cited by:[§6\.3](https://arxiv.org/html/2605.16452#S6.SS3.p1.1)\.
- M\. Blackstein \(2025\)RTX 5090: designed for the age of neural rendering\.In2025 IEEE Hot Chips 37 Symposium \(HCS\),pp\. 1–20\.Cited by:[Table 12](https://arxiv.org/html/2605.16452#A13.T12.1.1.4.2.1)\.
- C\. Brüser, S\. Winter, and S\. Leonhardt \(2013\)Robust inter\-beat interval estimation in cardiac vibration signals\.Physiological measurement34\(2\),pp\. 123\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- C\. Carlson, V\. Turpin, A\. Suliman, C\. Ade, S\. Warren, and D\. E\. Thompson \(2020\)Bed\-based ballistocardiography: dataset and ability to track cardiovascular parameters\.Sensors21\(1\),pp\. 156\.Cited by:[5th item](https://arxiv.org/html/2605.16452#A1.I1.i5.p1.1),[§4\.1](https://arxiv.org/html/2605.16452#S4.SS1.p1.1)\.
- D\. Castaneda, A\. Esparza, M\. Ghamari, C\. Soltanpur, and H\. Nazeran \(2018\)A review on wearable photoplethysmography sensors and their potential future applications in health care\.International journal of biosensors & bioelectronics4\(4\),pp\. 195\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- D\. Castelvecchi \(2016\)Can we open the black box of ai?\.Nature News538\(7623\),pp\. 20\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p3.1)\.
- A\. Chakraborty, D\. Sadhukhan, and M\. Mitra \(2020\)A robust ppg onset and systolic peak detection algorithm based on hilbert transform\.In2020 IEEE Calcutta Conference \(CALCON\),pp\. 176–180\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p3.1)\.
- T\. Chen and C\. Guestrin \(2016\)Xgboost: a scalable tree boosting system\.InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining,pp\. 785–794\.Cited by:[§5\.1](https://arxiv.org/html/2605.16452#S5.SS1.p3.1)\.
- Y\. Chen, K\. Ren, Y\. Wang, Y\. Fang, W\. Sun, and D\. Li \(2023a\)Contiformer: continuous\-time transformer for irregular time series modeling\.Advances in Neural Information Processing Systems36,pp\. 47143–47175\.Cited by:[§6\.5](https://arxiv.org/html/2605.16452#S6.SS5.p1.1)\.
- Z\. Chen, M\. Wang, M\. Zhang, W\. Huang, H\. Gu, and J\. Xu \(2023b\)Post\-processing refined ecg delineation based on 1d\-unet\.Biomedical Signal Processing and Control79,pp\. 104106\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p3.1)\.
- Z\. Chen, K\. Zheng, J\. Shen, Y\. Lin, Y\. Feng, and J\. Xu \(2023c\)Sample point classification of abdominal ecg through cnn\-transformer model enables efficient fetal heart rate detection\.IEEE Transactions on Instrumentation and Measurement73,pp\. 1–12\.Cited by:[3rd item](https://arxiv.org/html/2605.16452#A2.I2.i3.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p3.1)\.
- B\. H\. Choi, G\. S\. Chung, J\. Lee, D\. Jeong, and K\. S\. Park \(2009\)Slow\-wave sleep estimation on a load\-cell\-installed bed: a non\-constrained method\.Physiological measurement30\(11\),pp\. 1163\.Cited by:[6th item](https://arxiv.org/html/2605.16452#A2.I1.i6.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p2.1)\.
- H\. Dong, J\. Jiang, R\. Lu, J\. Luo, J\. Song, B\. Li, Y\. Shen, and Z\. Wang \(2025\)Beyond a single ai cluster: a survey of decentralized llm training\.arXiv preprint arXiv:2503\.11023\.Cited by:[Table 12](https://arxiv.org/html/2605.16452#A13.T12.1.1.5.3.1)\.
- M\. Elgendi, I\. Norton, M\. Brearley, D\. Abbott, and D\. Schuurmans \(2013\)Systolic peak detection in acceleration photoplethysmograms measured from emergency responders in tropical conditions\.PloS one8\(10\),pp\. e76585\.Cited by:[3rd item](https://arxiv.org/html/2605.16452#A1.I1.i3.p1.1.2),[§1](https://arxiv.org/html/2605.16452#S1.p3.1),[§4\.2](https://arxiv.org/html/2605.16452#S4.SS2.p1.2)\.
- M\. Elgendi \(2012\)On the analysis of fingertip photoplethysmogram signals\.Current cardiology reviews8\(1\),pp\. 14–25\.Cited by:[3rd item](https://arxiv.org/html/2605.16452#A2.I1.i3.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p2.1)\.
- M\. Fariha, R\. Ikeura, S\. Hayakawa, and S\. Tsutsumi \(2020\)Analysis of pan\-tompkins algorithm performance with noisy ecg signals\.InJournal of Physics: Conference Series,Vol\.1532,pp\. 012022\.Cited by:[1st item](https://arxiv.org/html/2605.16452#A2.I1.i1.p1.1),[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p2.1)\.
- E\. Fons, R\. Kaur, S\. Palande, Z\. Zeng, T\. Balch, M\. Veloso, and S\. Vyetrenko \(2024\)Evaluating large language models on time series feature understanding: a comprehensive taxonomy and benchmark\.arXiv preprint arXiv:2404\.16563\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.16452#S2.SS2.p2.1),[§3\.2](https://arxiv.org/html/2605.16452#S3.SS2.p3.1)\.
- P\. Fung, G\. Dumont, C\. Ries, C\. Mott, and M\. Ansermino \(2004\)Continuous noninvasive blood pressure measurement by pulse transit time\.InThe 26th annual international conference of the IEEE engineering in medicine and biology society,Vol\.1,pp\. 738–741\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- E\. Georganas, D\. Kalamkar, and A\. Heinecke \(2025\)Pushing the envelope of llm inference on ai\-pc\.arXiv preprint arXiv:2508\.06753\.Cited by:[Table 12](https://arxiv.org/html/2605.16452#A13.T12.1.1.4.2.1)\.
- J\. W\. Gordon \(1877\)Certain molar movements of the human body produced by the circulation of the blood\.Journal of anatomy and physiology11\(Pt 3\),pp\. 533\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p1.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu \(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.External Links:ISSN 1558\-2868,[Link](http://dx.doi.org/10.1145/3703155),[Document](https://dx.doi.org/10.1145/3703155)Cited by:[Appendix K](https://arxiv.org/html/2605.16452#A11.p1.1),[§6\.4](https://arxiv.org/html/2605.16452#S6.SS4.p1.1)\.
- W\. Karlen, S\. Raman, J\. M\. Ansermino, and G\. A\. Dumont \(2013\)Multiparameter respiratory rate estimation from the photoplethysmogram\.IEEE Transactions on biomedical engineering60\(7\),pp\. 1946–1953\.Cited by:[4th item](https://arxiv.org/html/2605.16452#A1.I1.i4.p1.1),[§4\.1](https://arxiv.org/html/2605.16452#S4.SS1.p1.1)\.
- K\. Kazemi, J\. Laitala, I\. Azimi, P\. Liljeberg, and A\. M\. Rahmani \(2022\)Robust ppg peak detection using dilated convolutional neural networks\.Sensors22\(16\),pp\. 6054\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p3.1)\.
- D\. J\. Kelley, T\. R\. Oakes, L\. L\. Greischar, M\. K\. Chung, J\. M\. Ollinger, A\. L\. Alexander, S\. E\. Shelton, N\. H\. Kalin, and R\. J\. Davidson \(2008\)Automatic physiological waveform processing for fmri noise correction and analysis\.PloS one3\(3\),pp\. e1751\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- C\. Kim, S\. L\. Ober, M\. S\. McMurtry, B\. A\. Finegan, O\. T\. Inan, R\. Mukkamala, and J\. Hahn \(2016\)Ballistocardiogram: mechanism and potential for unobtrusive cardiovascular health monitoring\.Scientific reports6\(1\),pp\. 31297\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p1.1)\.
- S\. Kuntamalla and L\. R\. G\. Reddy \(2014\)An efficient and automatic systolic peak detection algorithm for photoplethysmographic signals\.International Journal of Computer Applications97\(19\)\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p3.1)\.
- P\. Langer, T\. Kaar, M\. Rosenblattl, M\. A\. Xu, W\. Chow, M\. Maritsch, A\. Verma, B\. Han, D\. S\. Kim, H\. Chubb,et al\.\(2025\)OpenTSLM: time\-series language models for reasoning over multivariate medical text\-and time\-series data\.arXiv preprint arXiv:2510\.02410\.Cited by:[§2\.2](https://arxiv.org/html/2605.16452#S2.SS2.p1.1)\.
- M\. P\. LaValley \(2008\)Logistic regression\.Circulation117\(18\),pp\. 2395–2399\.Cited by:[§5\.1](https://arxiv.org/html/2605.16452#S5.SS1.p3.1)\.
- T\. Lerddararadsamee and Y\. Jiraraksopakun \(2012\)Local maximum detection for fully automatic classification of em algorithm\.In2012 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology,pp\. 1–4\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- B\. N\. Li, M\. C\. Dong, and M\. I\. Vai \(2010\)On an automatic delineator for arterial blood pressure waveforms\.Biomedical Signal Processing and Control5\(1\),pp\. 76–81\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- E\. Li, A\. B\. L\. Larsen, C\. Zhang, X\. Zhou, J\. Qin, D\. A\. Yap, N\. Raghavan, X\. Chang, M\. Bowler, E\. Yildiz,et al\.\(2025\)Apple intelligence foundation language models: tech report 2025\.arXiv preprint arXiv:2507\.13575\.Cited by:[Table 12](https://arxiv.org/html/2605.16452#A13.T12.1.1.3.1.1),[Appendix M](https://arxiv.org/html/2605.16452#A13.p1.1)\.
- Y\. Li, J\. Huang, X\. Yao, S\. Mu, S\. Zong, and Y\. Shen \(2024\)A ballistocardiogram dataset with reference sensor signals in long\-term natural sleep environments\.Scientific Data11\(1\),pp\. 1091\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.Advances in neural information processing systems30\.Cited by:[§2\.3](https://arxiv.org/html/2605.16452#S2.SS3.p1.1)\.
- Y\. Makin and R\. Maliakkal \(2025\)Sustainable ai training via hardware–software co\-design on nvidia, amd, and emerging gpu architectures\.In2025 IEEE International Conference on Service\-Oriented System Engineering \(SOSE\),pp\. 204–215\.Cited by:[Table 12](https://arxiv.org/html/2605.16452#A13.T12.1.1.5.3.1)\.
- D\. Makowski, T\. Pham, Z\. J\. Lau, J\. C\. Brammer, F\. Lespinasse, H\. Pham, C\. Schölzel, and S\. A\. Chen \(2021\)NeuroKit2: a python toolbox for neurophysiological signal processing\.Behavior research methods53\(4\),pp\. 1689–1696\.Cited by:[§B\.1](https://arxiv.org/html/2605.16452#A2.SS1.p1.1)\.
- G\. B\. Moody and R\. G\. Mark \(1992\)MIT\-bih arrhythmia database\.\(No Title\)\.Cited by:[1st item](https://arxiv.org/html/2605.16452#A1.I1.i1.p1.1),[§4\.1](https://arxiv.org/html/2605.16452#S4.SS1.p1.1)\.
- M\. Nabian, Y\. Yin, J\. Wormwood, K\. S\. Quigley, L\. F\. Barrett, and S\. Ostadabbas \(2018\)An open\-source feature extraction tool for the analysis of peripheral physiological data\.IEEE journal of translational engineering in health and medicine6,pp\. 1–11\.Cited by:[7th item](https://arxiv.org/html/2605.16452#A1.I1.i7.p1.1.2),[2nd item](https://arxiv.org/html/2605.16452#A2.I1.i2.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p2.1)\.
- T\. Nguyenet al\.\(2025\)An evaluation of llms inference on popular single\-board computers\.arXiv preprint arXiv:2511\.07425\.Cited by:[Table 12](https://arxiv.org/html/2605.16452#A13.T12.1.1.1.2),[Appendix M](https://arxiv.org/html/2605.16452#A13.p1.1)\.
- T\. Penzel, J\. W\. Kantelhardt, R\. P\. Bartsch, M\. Riedl, J\. F\. Kraemer, N\. Wessel, C\. Garcia, M\. Glos, I\. Fietze, and C\. Schöbel \(2016\)Modulations of heart rate, ecg, and cardio\-respiratory coupling observed in polysomnography\.Frontiers in physiology7,pp\. 460\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p1.1)\.
- M\. A\. Pimentel, A\. E\. Johnson, P\. H\. Charlton, D\. Birrenkott, P\. J\. Watkinson, L\. Tarassenko, and D\. A\. Clifton \(2016\)Toward a robust estimation of respiratory rate from pulse oximeters\.IEEE Transactions on Biomedical Engineering64\(8\),pp\. 1914–1923\.Cited by:[3rd item](https://arxiv.org/html/2605.16452#A1.I1.i3.p1.1),[§4\.1](https://arxiv.org/html/2605.16452#S4.SS1.p1.1)\.
- E\. J\. Pino, J\. A\. Chávez, and P\. Aqueveque \(2017\)BCG algorithm for unobtrusive heart rate monitoring\.In2017 IEEE healthcare innovations and point of care technologies \(HI\-poct\),pp\. 180–183\.Cited by:[5th item](https://arxiv.org/html/2605.16452#A2.I1.i5.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p2.1)\.
- Z\. F\. Pitafi, Y\. Song, Z\. Xie, B\. Brainard, and W\. Song \(2025\)Contactless vital signs monitoring for animals\.IEEE Internet of Things Journal\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- A\. Reiss, I\. Indlekofer, P\. Schmidt, and K\. Van Laerhoven \(2019\)Deep ppg: large\-scale heart rate estimation with convolutional neural networks\.Sensors19\(14\),pp\. 3079\.Cited by:[§4\.2](https://arxiv.org/html/2605.16452#S4.SS2.p1.2),[§6\.3](https://arxiv.org/html/2605.16452#S6.SS3.p1.1)\.
- S\. J\. Rigatti \(2017\)Random forest\.Journal of insurance medicine47\(1\),pp\. 31–39\.Cited by:[§5\.1](https://arxiv.org/html/2605.16452#S5.SS1.p3.1)\.
- P\. Sarkar and A\. Etemad \(2021\)Cardiogan: attentive generative adversarial network with dual discriminators for synthesis of ecg from ppg\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 488–496\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p3.1)\.
- L\. Sathyapriya, L\. Murali, and T\. Manigandan \(2014\)Analysis and detection r\-peak detection using modified pan\-tompkins algorithm\.In2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies,pp\. 483–487\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- C\. Schranz, C\. Halmich, S\. Mayr, and D\. P\. Heib \(2024\)Surrogate modelling of heartbeat events for improved j\-peak detection in bcg using deep learning\.Frontiers in Network Physiology4,pp\. 1425871\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p3.1),[§6\.3](https://arxiv.org/html/2605.16452#S6.SS3.p1.1)\.
- S\. Setiawidayat and A\. Y\. Rahman \(2018\)New method for obtaining peak value r and the duration of each cycle of electrocardiogram\.In2018 International Conference on Sustainable Information Engineering and Technology \(SIET\),pp\. 77–81\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§3\.4\.2](https://arxiv.org/html/2605.16452#S3.SS4.SSS2.p1.1)\.
- J\. Shin, B\. Choi, Y\. G\. Lim, D\. Jeong, and K\. Park \(2008\)Automatic ballistocardiogram \(bcg\) beat detection using a template matching approach\.In2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society,pp\. 1144–1146\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- Y\. Song, B\. Li, D\. Luo, G\. S\. B\. Glasgow, B\. G\. Phillips, Y\. Ke, and W\. Song \(2024\)Real\-time continuous blood pressure estimation with contact\-free bedseismogram\.InICC 2024\-IEEE International Conference on Communications,pp\. 214–219\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p1.1),[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- Y\. Song, B\. Li, D\. Luo, Z\. Xie, B\. G\. Phillips, Y\. Ke, and W\. Song \(2023\)Engagement\-free and contactless bed occupancy and vital signs monitoring\.IEEE internet of things journal11\(5\),pp\. 7935–7947\.Cited by:[7th item](https://arxiv.org/html/2605.16452#A1.I1.i7.p1.1),[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- Y\. Song, H\. Xiang, Z\. Zeng, J\. Chen, Y\. Zhang, Z\. F\. Pitafi, H\. Yang, Q\. Lu, X\. Zhang, B\. G\. Phillips,et al\.\(2025\)Multi\-granularity supervised contrastive learning with online adaptation for contactless in\-bed posture classification\.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9\(2\),pp\. 1–32\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- J\. Su, X\. Zhu, X\. Zhang, J\. Tang, and L\. Liu \(2009\)Ballistocardiogram measurement system using three load\-cell sensors platform in chair\.In2009 2nd International Conference on Biomedical Engineering and Informatics,pp\. 1–4\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- S\. Sun \(2011\)Meta\-analysis of cohen’s kappa\.Health Services and Outcomes Research Methodology11\(3\),pp\. 145–163\.Cited by:[Appendix A](https://arxiv.org/html/2605.16452#A1.p4.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[Table 12](https://arxiv.org/html/2605.16452#A13.T12.1.1.3.1.1),[Appendix M](https://arxiv.org/html/2605.16452#A13.p1.1)\.
- A\. Temko \(2017\)Accurate heart rate monitoring during physical exercises using ppg\.IEEE Transactions on Biomedical Engineering64\(9\),pp\. 2016–2024\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p1.1)\.
- V\. Tihonenko, A\. Khaustov, S\. Ivanov, and A\. Rivin \(2007\)St\.\-petersburg institute of cardiological technics 12\-lead arrhythmia database\.\(No Title\)\.Cited by:[2nd item](https://arxiv.org/html/2605.16452#A1.I1.i2.p1.1),[§4\.1](https://arxiv.org/html/2605.16452#S4.SS1.p1.1)\.
- E\. Tjoa and C\. Guan \(2020\)A survey on explainable artificial intelligence \(xai\): toward medical xai\.IEEE transactions on neural networks and learning systems32\(11\),pp\. 4793–4813\.Cited by:[§2\.3](https://arxiv.org/html/2605.16452#S2.SS3.p1.1)\.
- S\. Vijayarangan, R\. Vignesh, B\. Murugesan, S\. Preejith, J\. Joseph, and M\. Sivaprakasam \(2020\)RPnet: a deep learning approach for robust r peak detection in noisy ecg\.In2020 42nd annual international conference of the IEEE engineering in medicine & biology society \(EMBC\),pp\. 345–348\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p3.1)\.
- P\. Virtanen, R\. Gommers, T\. E\. Oliphant, M\. Haberland, T\. Reddy, D\. Cournapeau, E\. Burovski, P\. Peterson, W\. Weckesser, J\. Bright,et al\.\(2020\)SciPy 1\.0: fundamental algorithms for scientific computing in python\.Nature methods17\(3\),pp\. 261–272\.Cited by:[§3\.2](https://arxiv.org/html/2605.16452#S3.SS2.p2.5),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p3.1)\.
- Z\. Wang, G\. Xu, and M\. Ren \(2024\)Llm\-generated natural language meets scaling laws: new explorations and data augmentation methods\.arXiv preprint arXiv:2407\.00322\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p4.1)\.
- G\. J\. Warmerdam, R\. Vullings, L\. Schmitt, J\. O\. Van Laar, and J\. W\. Bergmans \(2018\)Hierarchical probabilistic framework for fetal r\-peak detection, using ecg waveform and heart rate information\.IEEE Transactions on Signal Processing66\(16\),pp\. 4388–4397\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p2.1)\.
- R\. M\. West \(2021\)Best practice in statistics: use the welch t\-test when testing the difference between two groups\.Annals of clinical biochemistry58\(4\),pp\. 267–269\.Cited by:[§4\.1](https://arxiv.org/html/2605.16452#S4.SS1.p3.2)\.
- L\. J\. Winters, D\. A\. Till, M\. L\. Bing, and J\. F\. Holmes \(2022\)Time required for electrocardiogram interpretation in the emergency department\.\.Academic Emergency Medicine29\(5\)\.Cited by:[§5\.4](https://arxiv.org/html/2605.16452#S5.SS4.p1.5),[Table 2](https://arxiv.org/html/2605.16452#S5.T2.1.1.19.18.4)\.
- X\. Wu, Z\. Wang, B\. Xu, and X\. Ma \(2020\)Optimized pan\-tompkins based heartbeat detection algorithms\.In2020 Chinese Control And Decision Conference \(CCDC\),pp\. 892–897\.Cited by:[§2\.1](https://arxiv.org/html/2605.16452#S2.SS1.p2.1)\.
- H\. Xu and K\. M\. J\. Shuttleworth \(2024\)Medical artificial intelligence and the black box problem: a view based on the ethical principle of “do no harm”\.Intelligent Medicine4\(1\),pp\. 52–57\.Cited by:[§1](https://arxiv.org/html/2605.16452#S1.p3.1)\.
- K\. Yang, M\. Hong, J\. Zhang, Y\. Luo, S\. Zhao, O\. Zhang, X\. Yu, J\. Zhou, L\. Yang, P\. Zhang,et al\.\(2025\)ECG\-lm: understanding electrocardiogram with a large language model\.Health Data Science5,pp\. 0221\.Cited by:[§2\.2](https://arxiv.org/html/2605.16452#S2.SS2.p1.1)\.
- D\. Yun, H\. Lee, C\. Jung, S\. Kwon, S\. Lee, K\. Kim, Y\. S\. Kim, and S\. S\. Han \(2022\)Robust r\-peak detection in an electrocardiogram with stationary wavelet transformation and separable convolution\.Scientific Reports12\(1\),pp\. 19638\.Cited by:[1st item](https://arxiv.org/html/2605.16452#A2.I2.i1.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p3.1)\.
- M\. D\. Zeiler and R\. Fergus \(2014\)Visualizing and understanding convolutional networks\.InEuropean conference on computer vision,pp\. 818–833\.Cited by:[§2\.3](https://arxiv.org/html/2605.16452#S2.SS3.p1.1)\.
- J\. Zhan, Z\. Li, X\. Wu, C\. Zhang, T\. Zhao, K\. Chen, and Z\. Lu \(2025\)A multi\-pathology ballistocardiogram dataset for cardiac function monitoring and arrhythmia assessment\.Scientific Data12\(1\),pp\. 963\.Cited by:[6th item](https://arxiv.org/html/2605.16452#A1.I1.i6.p1.1),[§4\.1](https://arxiv.org/html/2605.16452#S4.SS1.p1.1)\.
- Y\. Zhang, X\. Wang, X\. Yu, Z\. Zhou, X\. Xu, L\. Bai, and Y\. Wang \(2025\)DIFFODE: neural ode with differentiable hidden state for irregular time series analysis\.In2025 IEEE 41st International Conference on Data Engineering \(ICDE\),pp\. 1–14\.Cited by:[§6\.5](https://arxiv.org/html/2605.16452#S6.SS5.p1.1)\.
- T\. Zhou, S\. Men, J\. Liang, B\. Yu, H\. Zhang, and X\. Luo \(2021\)1D u\-net\+\+: an effective method for ballistocardiogram j\-peak detection\.Journal of Mechanics in Medicine and Biology21\(10\),pp\. 2140058\.Cited by:[2nd item](https://arxiv.org/html/2605.16452#A2.I2.i2.p1.1),[§4\.3](https://arxiv.org/html/2605.16452#S4.SS3.p3.1)\.
- C\. Zuo, Y\. Zhao, and J\. Ye \(2025\)TAU: modeling temporal consistency through temporal attentive u\-net for ppg peak detection\.arXiv preprint arXiv:2503\.10733\.Cited by:[§4\.2](https://arxiv.org/html/2605.16452#S4.SS2.p1.2),[§6\.3](https://arxiv.org/html/2605.16452#S6.SS3.p1.1)\.

## Appendix ADATASET DETAILS

To thoroughly evaluate the cross\-modal and explanatory capabilities of Peak\-Detector, we conducted experiments on a comprehensive suite of six publicly available physiological signal datasets\. This collection includes two datasets each for Electrocardiogram \(ECG\), Photoplethysmogram \(PPG\), and Ballistocardiogram \(BCG\) signals, ensuring broad coverage of signal modalities and varying data characteristics\. A summary of these datasets is provided below:

- •MIT\-BIH Arrhythmia Database\(Moody and Mark,[1992](https://arxiv.org/html/2605.16452#bib.bib51)\): This seminal dataset comprises 48 half\-hour, two\-channel ambulatory ECG recordings, acquired from 47 subjects between 1975 and 1979 by the BIH Arrhythmia Laboratory\. ECG signals are sampled at 360 Hz\. The dataset includes human annotations for various arrhythmia types\. For our purposes, both normal and arrhythmic QRS complexes are considered as target R\-peaks\.
- •Incart Arrhythmia Database\(Tihonenkoet al\.,[2007](https://arxiv.org/html/2605.16452#bib.bib52)\): This ECG dataset was collected from 32 patients \(17 men, 15 women; aged 18\-80, mean age: 58 years\) undergoing tests for coronary artery disease\. Signals are sampled at 257 Hz\. Similar to MIT\-BIH, the dataset provides annotations for arrhythmia types, and we include all detected QRS complexes \(normal and arrhythmic\) as target R\-peaks\.
- •BIDMC Database\(Pimentelet al\.,[2016](https://arxiv.org/html/2605.16452#bib.bib53)\): This dataset features synchronized ECG and PPG signals collected from 53 Intensive Care Unit \(ICU\) patients \(32 females, 21 males; median age: 64\.8 years\)\. Both signal modalities are sampled at 125 Hz\. Notably, the BIDMC dataset lacks pre\-existing systolic peak annotations for PPG signals\.Therefore, we employed a semi\-automatic annotation strategy: initial systolic peak detection was performed using Elgendi’s algorithm\(Elgendiet al\.,[2013](https://arxiv.org/html/2605.16452#bib.bib12)\), followed by meticulous manual correction by annotators based on signal morphology and temporal alignment with corresponding ECG R\-peaks\.
- •Capnobase Database\(Karlenet al\.,[2013](https://arxiv.org/html/2605.16452#bib.bib54)\): This dataset contains PPG signals recorded from 29 pediatric patients \(median age: 8\.7 years\) and 13 adult patients \(median age: 52\.4 years\) during medication examinations\. PPG signals were acquired in transmission mode via fingertip pulse oximeters at a sampling frequency of 300 Hz\. The dataset consists of 42 recordings, each approximately 8 minutes in duration\. Crucially, this dataset provides high\-quality systolic peak annotations that have been validated by medical experts\.
- •Kansas Database\(Carlsonet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib55)\): Developed by Kansas State University, this open\-source dataset offers synchronized multimodal physiological signals, including BCG, ECG, PPG, and Arterial Blood Pressure \(ABP\) waveforms\. BCG signals were captured using four electromechanical film \(EMFi\) sensors placed under the mattress and four load cells under the bed frame\. Data were collected from 40 subjects \(17 males, ages 18\-65 years\), with four subjects presenting cardiovascular\-related conditions\. The raw BCG signal is sampled at 1000 Hz; for our analysis, it was downsampled to 100 Hz to optimize for computational efficiency while retaining sufficient peak information within each segment\.As with BIDMC, this dataset lacked pre\-provided BCG J\-peak labels, necessitating the adoption of a similar semi\-automatic annotation procedure involving algorithmic detection using Elgendi’s algorithm and annotators review\.
- •BCG Arrhythmia Database\(Zhanet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib56)\): This dataset includes BCG recordings from 85 participants, encompassing individuals with sinus rhythm, heart failure \(HF\), and various cardiac arrhythmias such as Atrial Fibrillation \(AF\), Premature Ventricular Contractions \(PVCs\), and Premature Atrial Contractions \(PACs\)\. Signals are sampled at 100 Hz\.BCG J\-peaks in this database were labeled semi\-automatically, combining Elgendi’s algorithmic detection with subsequent manual verification by annotators\.
- •BSG ICU Database: This clinical dataset was acquired from Hospital under an IRB\-approved protocol with informed consent\. Ethical Approval and Data Collection is shown in Appendix\.[D](https://arxiv.org/html/2605.16452#A4)\. It comprises 1,120 hours of continuous BSG and ECG recordings from 44 ICU patients\. The cohort encompasses a wide demographic range, with patient ages spanning from 6 to 86 years \(pediatric to geriatric\)\. Both signals were sampled at a frequency of 100 Hz\. Following the quality control protocols established in\(Songet al\.,[2023](https://arxiv.org/html/2605.16452#bib.bib69)\), we retained a curated subset of 6,534 segments from 15 subjects for subsequent analysis\.The BCG J\-peaks in this database were annotated using a rigorous semi\-automatic protocol\. This process involved initial algorithmic detection using Nabian’s algorithm\(Nabianet al\.,[2018](https://arxiv.org/html/2605.16452#bib.bib59)\)to generate candidate peaks, followed by meticulous manual verification and correction by annotators\.

Human Annotation and Data Statistics\.We use official peak annotations when available; otherwise we create human\-verified peak labels\. The ground truth for the BIDMC, Kansas, BCG Arrhythmia, and BSG ICU datasets was established using a rigorous semi\-automated protocol, comprising initial algorithmic detection followed by meticulous manual verification\. We invited three annotators with prior experience in peak annotation, and trained them using a standardized guideline that defines modality\-specific fiducial characteristics \(ECG R, PPG systolic, BCG/BSG J\)\. To ensure label integrity, we enforced a strict consensus mechanism: only peak positions independently identified by all three annotators were retained\. Any samples exhibiting more than two disagreement peaks were excluded from the dataset\. The resulting dataset statistics are presented in Table[3](https://arxiv.org/html/2605.16452#A1.T3)\. To demonstrate the physiological diversity of the corpus, we report the total data volume \(number of segments and peaks\) alongside key hemodynamic metrics: Inter\-Beat Interval \(IBI\), Heart Rate \(HR\), and Heart Rate Variability \(HRV\)\. This rigorous annotation protocol ensures that the ground truth labels achieve the highest possible fidelity\. Although the preliminary annotation algorithm may be identical to the baseline, the evaluation phase maintains experimental integrity; unlike the manual annotation process, the baseline is assessed in a fully autonomous manner to ensure an equitable comparison across all models\.

Inter\-Rater Agreement Analysis:To evaluate the reliability of our ground truth labels, we conducted a rigorous inter\-rater agreement analysis, as detailed in Table[4](https://arxiv.org/html/2605.16452#A1.T4)\. We defineInitial Agreementas the percentage of peak candidates unanimously identified by all three independent annotators prior to any reconciliation or consensus\-building phases\. To further quantify the statistical reliability of these annotations, we utilized Cohen’s Kappa \(κ\\kappa\)\(Sun,[2011](https://arxiv.org/html/2605.16452#bib.bib92)\)to assess the level of consensus while accounting for the probability of agreement occurring by chance\.

As evidenced by the results, the PPG\-based BIDMC dataset exhibits the highest degree of agreement\. This is primarily attributable to the high signal\-to\-noise ratio and distinct morphological clarity of PPG signals, which allow for consistent peak localization\. Among the mechanical modalities, the Kansas BCG dataset demonstrates superior agreement compared to other BCG and BSG signals, likely due to the standardized and controlled laboratory conditions under which the data were acquired\.

Conversely, the BCG Arrhythmia and BSG ICU datasets show relatively lower agreement metrics\. This trend reflects the inherent challenges of annotating signals captured in clinical environments, where pathological arrhythmias, baseline wander, and varied patient movements introduce significant morphological ambiguity\.

Table 3\.Statistical summary of the utilized datasets including subject counts\. Values for IBI, HR, and HRV are presented as Mean±\\pmStandard Deviation\.Table 4\.Inter\-Rater Agreement Metrics for Semi\-Annotated Datasets
## Appendix BBASELINE DETAILS

### B\.1\.Signal\-Processing Baselines

These methods leverage domain\-specific heuristics and mathematical transformations tailored to individual signal modalities\. To simplify the replication, we adopt public implementation from neurokit2\(Makowskiet al\.,[2021](https://arxiv.org/html/2605.16452#bib.bib90)\)\.

- •Pan\-Tompkins\(Farihaet al\.,[2020](https://arxiv.org/html/2605.16452#bib.bib26)\): A widely adopted algorithm for ECG R\-peak detection, it utilizes a series of signal processing steps including filtering, differentiation, squaring, and moving\-window integration to extract slope and energy information, followed by adaptive thresholding to identify R\-peaks\.
- •Nabian\(Nabianet al\.,[2018](https://arxiv.org/html/2605.16452#bib.bib59)\): This ECG R\-peak detection method employs a sliding window technique\. It identifies the maximum amplitude within each window, designating it as a potential R\-peak, and subsequently refines these detections\.
- •Elgendi\(Elgendi,[2012](https://arxiv.org/html/2605.16452#bib.bib60)\): Designed for PPG systolic peak detection, this algorithm defines regions of interest by calculating two moving averages with distinct window sizes\. Peaks are then identified as local maxima within these predefined regions\.
- •Bishop\(Bishop and Ercole,[2018](https://arxiv.org/html/2605.16452#bib.bib61)\): A multi\-scale approach for PPG, it constructs a Local Maxima Scalogram \(LMS\) by analyzing the signal at various levels of smoothing\. Peaks are then robustly identified by detecting common local maxima across these different scales\.
- •Pino\(Pinoet al\.,[2017](https://arxiv.org/html/2605.16452#bib.bib62)\): This method focuses on BCG J\-peak detection, employing techniques such as wavelet transformations, template matching, and signal envelopes to enhance and isolate the characteristic J\-peak morphology within noisy BCG signals\.
- •Choi\(Choiet al\.,[2009](https://arxiv.org/html/2605.16452#bib.bib63)\): This BCG J\-peak detection algorithm segments the signal based on an estimated mean heartbeat interval\. It then identifies local maxima within each segment and eliminates false positives through analysis of peak\-to\-peak intervals, enhancing robustness to noise\.

### B\.2\.Deep Learning Baselines

These models represent advanced data\-driven approaches, designed to learn complex features directly from the physiological signals\.

- •CNN\-SWT\(Yunet al\.,[2022](https://arxiv.org/html/2605.16452#bib.bib64)\): Originally proposed for robust ECG R\-peak detection, this model is a Convolutional Neural Network \(CNN\) architecture that leverages Stationary Wavelet Transform \(SWT\) coefficients as input, combined with separable convolutions to enhance feature extraction\.
- •1D\-UNet\+\+\(Zhouet al\.,[2021](https://arxiv.org/html/2605.16452#bib.bib65)\): An extension of the U\-Net architecture, this model utilizes nested dense skip connections between its encoder and decoder paths\. Although initially designed for medical image segmentation, its 1D adaptation has demonstrated strong performance in time\-series analysis, including BCG J\-peak detection\.
- •FR\-Net\(Chenet al\.,[2023c](https://arxiv.org/html/2605.16452#bib.bib66)\): This is a specialized CNN\-Transformer based network primarily developed for fetal R\-peak detection in challenging ECG signals\. Its architecture combines convolutional layers for local feature extraction with transformer blocks to capture long\-range dependencies, making it suitable for complex peak detection tasks\.

## Appendix CDETAILED EXPERIMENT PROTOCOL

Supervised Fine\-Tuning ConfigurationPeak\-Detector was fine\-tuned using Qwen2\.5\-3B\-Instruct as the base model with a maximum sequence length of 2048 tokens and a dropout rate of 0\.1\. The model was trained for 5 epoachs using the AdamW optimizer with hyperparametersβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, andϵ=10−8\\epsilon=10^\{\-8\}\. The learning rate was set to2×10−52\\times 10^\{\-5\}with a cosine decay schedule following 500 warmup steps\. To prevent overfitting, we applied weight decay of 0\.01 and gradient clipping with a maximum norm of 1\.0\. The training process utilized a per\-device batch size of 32 with 4 gradient accumulation steps, yielding an effective batch size of 128\. Mixed precision training with BF16 was employed to improve computational efficiency\. The model was trained on 4×\\timesNVIDIA A6000 GPUs \(80GB each\)\.

GRPO ConfigurationThe model was further optimized using Group Relative Policy Optimization \(GRPO\)\. The actor model was initialized from a supervised fine\-tuned checkpoint with a learning rate of1×10−61\\times 10^\{\-6\}\. Maximum prompt and response lengths were set to 4000 and 2000 tokens, respectively\. To control policy divergence, KL divergence loss was enabled with a coefficient of 0\.001 using the low\-variance formulation\. The rollout generation utilized vLLM with tensor model parallelism and generated 16 samples per prompt\. Training was conducted on 4 GPUs for a single epoch with a training batch size of 24, PPO mini\-batch size of 12, and micro\-batch size of 2 per GPU\.

Code Availability\.The source code for the Peak\-Detector framework and the generated Peak\-Explanation Dataset is available on[GitHub](https://github.com/jimmylihui/Peak-Detector)\.

## Appendix DETHICAL APPROVAL AND DATA COLLECTION FOR BSG ICU DATASET

The data collection for this study was conducted in the hospital ICU under an Institutional Review Board \(IRB\)\-approved protocol\. The patient consent process adhered strictly to ethical guidelines: data from patients, or their legal representatives, who declined consent—either during their ICU stay or after recovery—were excluded from the analysis\.

The sensing devices were permanently installed beneath the hospital beds, seamlessly integrated as part of the bed infrastructure\. The data acquisition process was entirely passive \(i\.e\., involving no emission of active radiation\) and contactless \(i\.e\., without any physical contact with patients\)\. As such, the system posed no risk to human subjects or interference with existing medical equipment\. Importantly, the installation and operation of these devices did not disrupt standard ICU monitoring or treatment procedures, nor did they require any intervention from patients or clinical staff during routine care\.

To minimize any potential disruption, device setup was meticulously planned and executed\. Prior to installation, all devices were fully configured and validated for network connectivity and operational reliability\. Installation was performed under the supervision of hospital personnel, ensuring a rapid, safe, and non\-intrusive process\. Once deployed, the devices operated autonomously, requiring no further human intervention\. To further reduce maintenance demands within the ICU, devices were powered by a direct connection to the hospital mains, obviating the need for battery changes\. In addition, the system software was engineered with robust fault\-tolerance features, including automatic recovery from network or system errors, to maximize operational stability\.

All data collection activities were performed solely by hospital staff\. Data science researchers had access exclusively to de\-identified data, as provided by the hospital\. This strict separation ensured the protection of patient privacy and eliminated the risk of any information leakage\.

## Appendix ECONTROLLED ADDITIVE\-NOISE STRESS TEST

To partially probe robustness to noise that may arise in less controlled settings, we conducted a controlled additive\-noise stress test by adding zero\-mean white Gaussian noise to the input signal with standard deviations of 0\.1, 0\.2, 0\.3, 0\.4, and 0\.5\. The quantitative results are reported in Table[5](https://arxiv.org/html/2605.16452#A5.T5), and representative visualizations are shown in Fig\.[16](https://arxiv.org/html/2605.16452#A5.F16)\. We include this experiment as a controlled stress test of noise tolerance, rather than as a substitute for fully free\-living evaluation\.

As the perturbation level increases, performance degrades gradually and monotonically across all evaluation metrics\. For heart\-rate estimation, the MAE increases from0\.86±0\.080\.86\\pm 0\.08BPM on clean signals to2\.99±0\.242\.99\\pm 0\.24BPM at the highest noise level, while HR MAPE rises from1\.24±0\.15%1\.24\\pm 0\.15\\%to4\.34±0\.34%4\.34\\pm 0\.34\\%\. A similar trend is observed for HRV estimation: HRV MAE increases from9\.97±0\.179\.97\\pm 0\.17ms to25\.95±1\.5525\.95\\pm 1\.55ms, and HRV MAPE rises from61\.10±0\.91%61\.10\\pm 0\.91\\%to181\.12±29\.66%181\.12\\pm 29\.66\\%as the perturbation becomes more severe\.

Peak\-detection performance also declines with increasing noise, but remains comparatively stable overall\. Specifically, the F1 score decreases from0\.9729±0\.00210\.9729\\pm 0\.0021under clean conditions to0\.9343±0\.00110\.9343\\pm 0\.0011at the highest noise level\. Precision drops from0\.9710±0\.00250\.9710\\pm 0\.0025to0\.9239±0\.00120\.9239\\pm 0\.0012, and recall decreases from0\.9749±0\.00230\.9749\\pm 0\.0023to0\.9448±0\.00120\.9448\\pm 0\.0012\. These results suggest that the detector tolerates moderate additive perturbations reasonably well for beat detection and heart\-rate estimation, whereas HRV\-related metrics are more sensitive as the corruption level increases\.

Overall, this analysis provides supporting evidence that the proposed model remains functional under controlled additive\-noise perturbations, particularly for peak detection and HR estimation\. At the same time, we do not interpret this experiment as a replacement for fully in\-the\-wild validation, where motion artifacts, posture transitions, contact variability, sensor displacement, and longer\-term domain shift may introduce more complex failure modes than additive Gaussian noise alone\.

Table 5\.Performance of Peak\-Detector under controlled additive Gaussian noise perturbations\.![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/mitbih_sample_100_noise_level_1.png)\(a\)Noise levelσ=0\.1\\sigma=0\.1
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/mitbih_sample_100_noise_level_2.png)\(b\)Noise levelσ=0\.2\\sigma=0\.2
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/mitbih_sample_100_noise_level_3.png)\(c\)Noise levelσ=0\.3\\sigma=0\.3
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/mitbih_sample_100_noise_level_4.png)\(d\)Noise levelσ=0\.4\\sigma=0\.4
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/mitbih_sample_100_noise_level_5.png)\(e\)Noise levelσ=0\.5\\sigma=0\.5

Figure 16\.Visualization of signal quality under different additive white Gaussian noise levels\. Increasing noise progressively distorts the waveform and makes peak detection more challenging\.Five ECG signal plots showing progressively stronger additive Gaussian noise levels\.
## Appendix FFAILURE CASE ANALYSIS

This section examines the representative failure modes of Peak\-Detector on the MIT\-BIH Arrhythmia Database, as illustrated in Fig\.[17](https://arxiv.org/html/2605.16452#A6.F17)\.

In the first segment \(Fig\.[17\(a\)](https://arxiv.org/html/2605.16452#A6.F17.sf1)\), the model misidentifies the S\-wave as an R\-peak\. This error is primarily attributed to the prominent negative amplitude of the S\-wave, which exhibits a morphology similar to that of an inverted R\-peak, leading to a false positive\. In the subsequent segments \(Fig\.[17\(b\)](https://arxiv.org/html/2605.16452#A6.F17.sf2)\), the model fails to localize the peaks accurately due to severe signal distortion\. In these instances, the underlying QRS complex is obscured by high\-frequency noise and baseline wander, resulting in a significantly reduced signal\-to\-noise ratio \(SNR\) that makes reliable peak selection more difficult\.

Specialized ECG models such as the UNet\+\+ can sometimes achieve smaller coordinate error on cleaner ECG signals because they are optimized for precise local QRS localization\. However, proximity alone does not guarantee physiologically correct fiducial localization: a prediction may fall between adjacent landmarks \(e\.g\., between the R and S peaks\) while still appearing close to the annotation\. In contrast, traditional peak detectors usually align more closely with anatomically meaningful morphological features\. This observation suggests that future improvements should place greater emphasis on anatomically meaningful fiducial alignment, rather than optimizing only global localization error\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/failure.png)\(a\)S\-peak misidentification
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/appendix/failure2.png)\(b\)Motion artifact distortion

Figure 17\.Analysis of failure cases in the MIT\-BIH dataset: \(a\) morphological confusion between S\-peaks and R\-peaks, and \(b\) detection failures caused by extreme signal corruption\.Two failure\-case plots showing S\-peak misidentification and motion\-artifact distortion\.
## Appendix GABLATION STUDY

To dissect the individual contributions of each component within the Peak\-Detector framework, we conducted a systematic ablation study on the BCG Arrhythmia dataset, with the results detailed in Table[6](https://arxiv.org/html/2605.16452#A7.T6)\. Our analysis begins with the baseline performance of a general\-purpose, instruction\-tuned LLM \(qwen2\.5\-3B\-instruct\), which yields a remarkably low F1\-score of 0\.1690\. This confirms that standard LLMs are ill\-suited for this precise numerical inference task without specialized adaptation\.

Table 6\.Ablation study results on the BCG Arrhythmia dataset, comparing precision, recall, F1\-score, HR MAE \(BPM\), and HRV MAE under a 50 ms matching tolerance\. Best performance per block is highlighted in bold\. “\-” indicates no valid result could be obtained\.The introduction of our specializedPeak Representationis critical; removing this component \(‘w/o Peak Representation‘\) and relying on raw sequences limits the F1\-score to 0\.8460 and results in a significantly higher HR MAE of 7\.87 BPM, underscoring the difficulty LLMs face in processing raw numerical data efficiently\.Supervised Fine\-Tuning \(SFT\)on our dataset provides the most substantial performance leap, increasing the F1\-score from the baseline 0\.1690 to 0\.8822\. However, SFT alone is insufficient to reach expert\-level performance, as evidenced by the performance gap between the ‘SFT\-only‘ variant and the ‘Full Framework‘ \(0\.8822 vs\. 0\.9467 F1\-score\)\.

TheFull Framework, which integrates explanation generation and optimization, achieves the best physiological consistency with an HR MAE of 1\.43 BPM\. Notably, the ablation of the explanation task \(‘w/o Explanation‘\) yields a marginally higher F1\-score of 0\.9476 compared to the Full Framework’s 0\.9467\. However, this comes at the cost of physiological accuracy, as the HR MAE degrades from 1\.43 to 1\.87 BPM\. This suggests that while the model can learn to detect peaks without generating explanations, compelling it to articulate reasoning \(the explanation task\) acts as a regularizer that enforces better alignment with the underlying cardiac rhythm, reducing heart rate estimation errors\.

Finally, the specialized nature of our approach is highlighted when compared against powerful, general\-purpose LLMs\. Claude\-Sonnet\-4\.5, Gemini\-2\.5\-Pro, and GPT\-5 achieve F1\-scores of 0\.8169, 0\.7723, and 0\.6490 respectively—significantly lower than our framework\. Furthermore, their HR MAEs range from 23\.26 to 66\.32 BPM, which is orders of magnitude higher than our model’s 1\.43 BPM\. This unequivocally demonstrates that our domain\-specific approach—combining a novel signal representation with a targeted tuning strategy—is essential for achieving high\-precision, physiologically valid performance\.

## Appendix HIMPACT OF RELATIVE TEMPORAL TOLERANCE

To mitigate the bias inherent in fixed temporal tolerances across varying heart rates, we additionally report arate\-normalizedevaluation using anadaptive tolerance of±5%\\pm 5\\%of the local IBI , shown in Table[7](https://arxiv.org/html/2605.16452#A8.T7), and we apply the same scoring protocol to all methods for fair comparison\. Unlike fixed\-window metrics, this adaptive criterion strictly tightens detection constraints during periods of tachycardia, thereby preventing the erroneous alignment of artifacts with true cardiac events\. All methods \(our model and all baselines\) are scored using the*same*peak\-matching protocol under each tolerance setting: one\-to\-one matching between predicted and ground\-truth peaks, and TP is counted only when a predicted peak matches a ground\-truth peak within the specified tolerance window\. Therefore, the comparison is fair under both the fixed \(±30\\pm 30ms\) and adaptive \(±5%\\pm 5\\%IBI\) metrics\. Under this adaptive criterion,Peak\-Detectorexhibited remarkable stability, showing minimal degradation compared to fixed\-window results \(e\.g\., maintaining an F1\-score of 0\.9698 on MIT\-BIH\)\. This resilience was characteristic of learning\-based models \(both Peak\-Detector and deep\-learning based\), which generally maintained consistent performance across both tolerance paradigms, indicating the successful encoding of scale\-invariant temporal features\. In contrast, heuristic baselines such as Pan\-Tompkins degrade substantially \(e\.g\., F1 dropping from 0\.4361 \(fixed\) to 0\.2861 \(relative\) on MIT\-BIH\), underscoring the brittleness of static morphological thresholds when subjected to rate\-modulated signal variability\.

Table 7\.Peak Detection Performance Comparison across Signal Modalities and Baselines using aRelative Temporal Tolerance\(±5%\\pm 5\\%of the local peak\-to\-peak interval\)\. Lower values are better for MAE \(HR\(bpm\), HRV\(ms\)\) and MAPE, higher values are better for F1, Pre, and Rec\. Best performance in each metric isbold, second best isunderlined\. Standard deviations are shown with the same reduced formatting as Table[1](https://arxiv.org/html/2605.16452#S4.T1)\. Superscripts on Peak\-Detector indicate significance against the second\-best method:\*p<0\.05p<0\.05,\*\*p<0\.01p<0\.01\. Thepp\-value column is retained for reference\.
## Appendix IIMPACT OF MODEL SCALE

To disentangle the performance benefits of our proposed architecture from mere parameter scaling, this section investigates the relationship between model capacity and Heart Rate \(HR\) estimation accuracy across the evaluated deep learning baselines\. We systematically vary the trainable parameters of each model by adjusting channel dimensions and network depth to analyze the trade\-off between computational cost and performance \(measured by Mean Absolute Error, MAE\)\.

##### Experimental Setup

We implemented capacity scaling for three distinct architectures:

- •FR\-Net:We scaled the model by modifying the base channel sizeC∈\{32,48,64,96,128,160\}C\\in\\\{32,48,64,96,128,160\\\}, resulting in a parameter size ofS∈\{2\.5​M,4\.3​M,6\.7​M,12\.6​M,20\.4​M,29\.9​M\}S\\in\\\{2\.5M,4\.3M,6\.7M,12\.6M,20\.4M,29\.9M\\\}\.
- •Unet\+\+:We explored variations in both both base filter count \(FF\) and network depth \(DD\), evaluating combinations of\(F,D\)∈\{\(8,3\),\(16,3\),\(16,4\),\(32,3\),\(32,4\),\(16,5\)\}\(F,D\)\\in\\\{\(8,3\),\(16,3\),\(16,4\),\(32,3\),\(32,4\),\(16,5\)\\\}, resulting in a parameter size ofS∈\{0\.1​M,0\.4​M,1\.6​M,1\.7​M,7\.1​M,7\.3​M\}S\\in\\\{0\.1M,0\.4M,1\.6M,1\.7M,7\.1M,7\.3M\\\}\.
- •CNN\-SWT:We evaluated six distinct channel progression configurations, ranging from a lightweight model \(\{8→16→32→64→128\}\\\{8\\to 16\\to 32\\to 64\\to 128\\\}\) to a high\-capacity variant \(\{48→96→192→384→768\}\\\{48\\to 96\\to 192\\to 384\\to 768\\\}\), resulting in a parameter size ofS∈\{0\.2​M,0\.7​M,1\.5​M,2\.7​M,4\.2​M,6\.1​M\}S\\in\\\{0\.2M,0\.7M,1\.5M,2\.7M,4\.2M,6\.1M\\\}\.

##### Performance Analysis

The relationship between parameter size and HR MAE for these baselines is illustrated in Fig\.[18](https://arxiv.org/html/2605.16452#A9.F18)\. Increasing model capacity can reduce MAE initially, but as scaling continues, we observe a performance plateau where improvements saturate and, in some instances, degrade\. This trend indicates that indiscriminate parameter expansion in standard deep learning models yields diminishing returns and may introduce overfitting or optimization difficulties\.

In contrast, Fig\.[15](https://arxiv.org/html/2605.16452#S5.F15)reports Peak\-Detector performance when fine\-tuning the same formulation on Qwen backbones ranging from 0\.5B to 7B parameters under an identical protocol\. On BCG Arrhythmia, larger backbones consistently improve HR/HRV estimation and detection quality in our tested range\.

##### Conclusion

These findings suggest that the performance gap is not driven solely by model size; standard deep learning models appear unable to effectively utilize additional capacity beyond a certain threshold\. Conversely, the Peak\-Detector demonstrates a capability to leverage larger parameter spaces for continuous improvement\. Furthermore, beyond quantitative superiority at scale, the Peak\-Detector offers inherent explainability, a distinct advantage over black\-box deep learning models\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/revised_appendix/hr_mae_vs_params_yun.png)\(a\)CNN\-SWT
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/revised_appendix/hr_mae_vs_params.png)\(b\)1D\-Unet\+\+
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/revised_appendix/hr_mae_vs_params_CNN.png)\(c\)FR\-Net

Figure 18\.Impact of model scale on performance\. We analyze the trade\-off between the number of trainable parameters and Heart Rate Mean Absolute Error \(HR MAE\) for three architectures: \(a\) CNN\-SWT, \(b\) Unet\+\+, and \(c\) FR\-Net\.Three plots comparing model parameter counts against heart\-rate mean absolute error for deep learning baselines\.

## Appendix JEVALUATION OF MODEL COMPRESSION TECHNIQUES

To characterize the resource trade\-offs of compressing Large Language Models \(LLMs\) in our setting, we conducted a comparative analysis of several memory\-reduction strategies on the BCG Arrhythmia dataset\. The primary goal of this experiment is to quantify the trade\-off between reduced memory footprint and downstream physiological accuracy, rather than to claim mature real\-time readiness for mobile or wearable deployment\. All experiments were conducted with a batch size of 1 to provide a consistent evaluation setting across compression methods\.

We evaluated three optimization strategies against a full\-precision \(FP16\) baseline: Weight–Activation Quantization \(W4A16\), Structural Pruning \(50% sparsity\), and Model Distillation using a 0\.5B\-parameter student model\. The baseline Qwen2\.5\-3B model serves as the highest\-fidelity configuration in this comparison, achieving a heart\-rate mean absolute error \(HR MAE\) of 1\.75 BPM with a memory footprint of 5\.792 GB\.

The experimental results, summarized in Table[8](https://arxiv.org/html/2605.16452#A10.T8), highlight distinct trade\-offs for each compression method:

- •Weight–Activation Quantization \(W4A16\):This method reduces memory usage by approximately 57% \(to 2\.492 GB\) by quantizing weights to 4\-bit integers while maintaining 16\-bit activations\. This reduction is accompanied by a moderate increase in HR MAE to 2\.85 BPM\.
- •Structural Pruning \(50% Sparsity\):Pruning further reduces memory usage to 2\.145 GB, but yields the largest degradation in accuracy among the tested methods, increasing HR MAE to 4\.44 BPM\. This result suggests that the model’s performance on BCG signals is sensitive to the removal of structural capacity\.
- •Model Distillation \(0\.5B\):Distillation achieves the largest memory reduction, lowering the footprint to 0\.942 GB\. However, the 0\.5B student model produces an HR MAE of 3\.26 BPM, indicating a substantial performance gap relative to the 3B teacher configuration under this task setting\.

Table 8\.Experimental Comparison of LLM Optimization Methods for Mobile BCG Analysis \(Qwen 2\.5 3B\)\.HR MAE: Heart Rate Mean Absolute Error\.Overall, these results show that aggressive compression can substantially reduce memory requirements, but at the cost of non\-negligible degradation in HR accuracy\. Under our test setting, this trade\-off suggests that the current 3B configuration remains the strongest option when higher\-fidelity peak analysis is required, while smaller compressed variants may be useful only in more resource\-constrained scenarios where reduced memory usage is prioritized over accuracy\. Accordingly, we interpret this analysis as evidence of a meaningful memory–accuracy trade\-off in our LLM\-based formulation, rather than as validation of strict real\-time, always\-on wearable deployment\. In line with the broader deployment framing of this paper, the present results supportPeak\-Detectormost naturally as a selectively invoked, higher\-fidelity analysis component, while more resource\-aware variants remain an important direction for future work\.

## Appendix KVERIFICATION FRAMEWORK FOR LLM HALLUCINATION

Hallucination\(Huanget al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib73)\)is a known risk when LLMs generate free\-form rationales\. In our data\-construction pipeline, the teacher LLM isnotresponsible for peak labeling: it is explicitly provided with \(i\) the candidate extrema list from our Peak Representation and \(ii\) theground\-truthpeak list, and is only asked to explainwhythe ground\-truth peaks should be selected from the candidates\. Therefore, hallucination risk is confined to theexplanation text, mainly in two forms: \(a\)*objective inconsistencies*\(e\.g\., timestamps/amplitudes/intervals that contradict the provided metadata\), and \(b\)*semantic inaccuracies*\(e\.g\., logically unsupported rationale\)\. To ensure explanations are faithful and usable for instruction tuning, we apply a three\-stage verification pipeline before adding any teacher\-generated explanation into the Peak\-Explanation dataset:

1. \(1\)Visualization\-Assisted Verification Interface:We parse the teacher LLM output and render it as an interactive overlay on the waveform \(Fig\.[19](https://arxiv.org/html/2605.16452#A11.F19)\)\. The interactive user interface \(UI\) explicitly highlights*LLM\-selected target peaks*and*LLM\-rejected candidate peaks*together with their timestamps, indices, amplitudes, and the corresponding justification, enabling fast sanity checks by human reviewers\.
2. \(2\)Rule\-Based Factual Consistency Check:Weautomaticallyverify objective claims and formatting in the explanation: \(i\) the output peak list matches the provided ground truth; \(ii\) every referenced timestamp/index exists in the candidate list; \(iii\) stated amplitudes and inter\-beat intervals are consistent with the underlying numeric values \(within a small tolerance\); and \(iv\) the output follows the required template\. Explanations with any factual inconsistency are discarded\.
3. \(3\)Qualitative Human Review:For semantic quality, human reviewers inspect the explanations using the UI and label them aConcise\(accurate and clear\),Ambiguous\(technically correct but a bit vague\), orIncorrect\(logical fallacies or misleading rationale\)\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/revised_appendix/interactive_visualization_for_J_peak.png)\(a\)Target Peaks
![Refer to caption](https://arxiv.org/html/2605.16452v1/source/revised_appendix/Interactive_Visualization_rejected_peaks.png)\(b\)Rejected Peaks

Figure 19\.Interactive verification interface\. \(a\) Visualization of valid Target Peaks with morphological details\. \(b\) Visualization of Rejected Peaks with associated exclusion criteria\.Two interface screenshots showing accepted target peaks and rejected candidate peaks with explanations\.Specifically, in Step 1, as illustrated in Fig\.[19\(a\)](https://arxiv.org/html/2605.16452#A11.F19.sf1), the interface for valid target peaks aggregates critical metadata, including the timestamp, signal index, peak amplitude, and a detailed justification of amplitude dominance, alongside the associated fiducial complex morphology \(e\.g\. I\-J\-K\)\. Conversely, for filtered candidates, the UI displays the timestamp, index, and amplitude, while explicitly highlighting the rejection rationale, primarily focusing on physiological interval violations, shown in Fig\.[19\(b\)](https://arxiv.org/html/2605.16452#A11.F19.sf2)\. Note that these peaks was rejected based on the LLM’s explanation for candidate peaks, which differs from the rejection criteria used in the verification framework\.

Verification Results\.The quantitative outcomes of the verification framework are summarized in Table[9](https://arxiv.org/html/2605.16452#A11.T9)\. TheRule\-Based Factual Consistency Check\(Step 2\) was applied to the full dataset to guarantee complete adherence to objective signal constraints\. Conversely, for theQualitative Human Review\(Step 3\), we employed random sampling \(N=250N=250per dataset\) to derive a statistically representative estimate of semantic quality without the prohibitive cost of exhaustive manual annotation\. In Step 2, we observed0% rejection rateacross all objective checks the model demonstrated robust adherence to the ground truth signal attributes, indicating that the teacher LLM consistently grounded the numerical reasoning in the provided signal metadata without hallucinating objective parameters\.

Table 9\.Comprehensive Verification Results\. The table shows the attrition of hallucinations during the automated check \(Step 2\) and the semantic quality distribution of the remaining explanations during the human review \(Step 3\)\.Step 3 evaluated the semantic utility of the generated content as shown in Table[9](https://arxiv.org/html/2605.16452#A11.T9)\. The assessment revealed that the vast majority of explanations \(91\.7%\) wereConcise, providing accurate and clear justifications for the detected peaks\. A minor fraction \(8\.3%\) were classified asAmbiguous, where the reasoning was technically correct but lacked specificity\. Notably, 0% observedIncorrectcases withN=250N=250, suggesting that the model maintained logical coherence and avoided fabricating rationale that contradicted the signal morphology\. These analyses suggests that semantic hallucination is rare in practice after our verification\. As a result, we retain all teacher\-generated explanations \(including ambiguous ones\) for training, given that exhaustive manual vetting is impossible\.

## Appendix LCROSS\-DATASET GENERALIZATION ANALYSIS

To assess the generalization capabilities and robustness of Peak\-Detector to domain shift, we conducted a comprehensive cross\-dataset evaluation\. Models were trained on a single source dataset and evaluated on all others\. The F1\-score results for Peak\-Detector and the strongest baseline, FR\-Net, are presented in Table[10](https://arxiv.org/html/2605.16452#A12.T10)and Table[11](https://arxiv.org/html/2605.16452#A12.T11), respectively\.

In\-Domain Analysis\.When evaluated on an in\-domain but unseen dataset \(e\.g\., training on MIT\-BIH and testing on Incart\), Peak\-Detector’s F1\-score drops slightly from 0\.9805 to 0\.9580\. This minor degradation, observed across all modalities, suggests that while the model learns modality\-specific features effectively, it is also susceptible to dataset\-specific biases, even within the same signal type\.

Table 10\.Cross\-Dataset Generalization Performance \(F1\-score\) for Peak\-Detector\. Rows indicate the training \(source\) dataset and columns indicate the testing \(target\) dataset\.Table 11\.Cross\-Dataset Generalization Performance \(F1\-score\) for FR\-Net\. Rows indicate the training \(source\) dataset and columns indicate the testing \(target\) dataset\.Cross\-Domain Analysis\.Performance degrades more significantly in cross\-domain scenarios, highlighting that models learn features specific to the physiological origin of each modality \(electrical vs\. optical vs\. mechanical\)\. For instance, Peak\-Detector trained on ECG data \(MIT\-BIH\) achieves an F1\-score of 0\.6304 when tested on PPG data \(BIDMC\)\. Crucially, even this attenuated performance is substantially higher than that of an untrained base model \(F1\-score of 0\.1712, see Table[6](https://arxiv.org/html/2605.16452#A7.T6)\), indicating that the framework learns a foundational, transferable knowledge of what constitutes a ”peak\.” In comparison, FR\-Net suffers a far more severe performance degradation in similar scenarios\. When trained on MIT\-BIH \(ECG\) and tested on Capnobase \(PPG\), FR\-Net’s F1\-score plummets to 0\.1005, a near\-total failure\. In the same scenario, Peak\-Detector maintains a functional F1\-score of 0\.7466\. This stark difference suggests that while FR\-Net excels at learning modality\-specific features, its representations are less transferable, whereas Peak\-Detector’s language\-based reasoning provides greater out\-of\-domain robustness\.

Performance of a Combined Model\.The most compelling evidence for our framework’s potential as a universal peak detector is the performance of the Combined Model, trained on a mixed dataset \(MIT\-BIH, BIDMC, and Kansas\)\. As shown in the final row of Table[10](https://arxiv.org/html/2605.16452#A12.T10), this single model achieves robust and high\-quality performance across all six test sets, closely approaching the results of the specialized, individually trained models \(e\.g\., 0\.9877 on BIDMC vs\. 0\.9928 for the specialized model\)\. In contrast, the combined FR\-Net model exhibits poor generalization\. While it performs well on the modalities included in its mixed training \(ECG and BCG\), its performance on the unseen PPG datasets is significantly worse than that of a specialized PPG\-trained FR\-Net \(e\.g\., an F1\-score of 0\.2467 on Capnobase, compared to 0\.9844 for the specialized model\)\. This outcome strongly suggests that the Peak\-Detector framework can be effectively scaled with diverse data to create a single, powerful, and truly generalizable model for cross\-modal analysis—a capability not demonstrated by the conventional deep learning architecture\.

## Appendix MEXPLORATORY DEVICE FEASIBILITY AND DEPLOYMENT BOUNDARIES

To characterize the deployment boundaries ofPeak\-Detector, we analyze the relationship between available hardware resources and LLM parameter scale across representative device tiers\. As summarized in Table[12](https://arxiv.org/html/2605.16452#A13.T12), computational budgets vary substantially across the edge\-to\-server spectrum\. Prior studies suggest that resource\-constrained IoT platforms can support only relatively small models, whereas modern smartphones and other consumer devices may accommodate somewhat larger configurations under specialized hardware acceleration\(Nguyen and others,[2025](https://arxiv.org/html/2605.16452#bib.bib93); Liet al\.,[2025](https://arxiv.org/html/2605.16452#bib.bib94); Teamet al\.,[2023](https://arxiv.org/html/2605.16452#bib.bib95)\)\. We include this comparison primarily to contextualize deployment trade\-offs, rather than to claim mature real\-time readiness for wearable or mobile execution\.

The baselinePeak\-Detectoruses a 3B\-parameter model as the primary high\-fidelity configuration evaluated in this work\. We further examined compression through quantization, pruning, and distillation to study the memory–accuracy trade\-off under more constrained settings \(Appendix[J](https://arxiv.org/html/2605.16452#A10)\)\. These experiments show that substantial reductions in memory footprint are possible, but they are accompanied by non\-negligible degradation in downstream physiological accuracy\. Accordingly, we do not interpret these results as evidence of comparable edge performance, strict real\-time mobile deployment, or always\-on suitability for battery\-constrained wearables\.

Instead, we view the current evidence as supportingPeak\-Detectormost strongly in a selective, higher\-fidelity analysis role within a tiered ubiquitous sensing pipeline\. In such a setting, lightweight front\-end methods can perform continuous low\-power screening, whilePeak\-Detectoris invoked only for flagged windows, uncertain segments, or retrospective summaries that require interpretable rationales and more reliable peak localization\. Developing more resource\-aware variants that better preserve both detection quality and explanation quality under edge constraints remains an important direction for future work\.

Table 12\.LLM Inference Capacity by Device Category \(Standard 4\-bit Quantization\)
## Appendix NIMPLICATIONS FOR ARRHYTHMIC PEAK RECOVERY

To examine the relevance ofPeak\-Detectorto arrhythmia\-sensitive settings, we evaluated its ability to recover peaks associated with irregular cardiac events, which constitute an important upstream prerequisite for downstream rhythm analysis\. We quantify this behavior usingArrhythmic Peak Recall, defined as the proportion of correctly identified arrhythmic beats within a 50 ms tolerance window\. The evaluation focuses on Atrial Premature Contractions \(APCs\) and Ventricular Premature Contractions \(VPCs\), both of which introduce irregular timing patterns that can challenge peak localization\.

To assess both in\-distribution and out\-of\-distribution behavior, the model was trained exclusively on theMIT\-BIH Arrhythmia Databaseand evaluated on both the same dataset and the unseenMIT\-BIH Atrial Fibrillation Database\. As illustrated in Figures[21](https://arxiv.org/html/2605.16452#A14.F21.1)and[21](https://arxiv.org/html/2605.16452#A14.F21.1),Peak\-Detectorachieves high arrhythmic peak recall across subjects\. In the in\-distribution setting, recall exceeds 99% for most subjects, while on the unseen atrial fibrillation dataset it remains above 97% for most subjects\.

These results suggest thatPeak\-Detectorcan maintain strong peak recovery performance even in the presence of irregular rhythms, and that the learned representation transfers reasonably well to unseen arrhythmia\-related conditions\. We interpret this analysis as evidence of downstream relevance rather than as a full validation of downstream arrhythmia analysis\. In particular, while accurate peak recovery is an important prerequisite for rhythm analysis, end\-to\-end arrhythmia classification involves additional considerations beyond the scope of the present work\.

![Refer to caption](https://arxiv.org/html/2605.16452v1/source/implication/arrhythmia_recall.png)Figure 20\.MIT\-BIH Arrhythmia Recall![Refer to caption](https://arxiv.org/html/2605.16452v1/source/implication/AFib_recall.png)Figure 21\.MIT\-BIH Atrial Fibrillation RecallTwo recall plots showing arrhythmic peak recovery on MIT\-BIH arrhythmia and atrial fibrillation datasets\.

## Appendix OVISUALIZATION

Table 13\.Comparison of J\-Peak Detection Results and Explanations from Teacher LLM and Fine\-tuned Peak\-Detector

Similar Articles

PEEK: Picking Essential frames via Efficient Knowledge distillation

Hugging Face Daily Papers

Introduces PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.

Applied Explainability for Large Language Models: A Comparative Study

arXiv cs.CL

A comparative study evaluating three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on fine-tuned DistilBERT for sentiment classification, highlighting trade-offs between gradient-based, attention-based, and model-agnostic approaches for LLM interpretability.

Base Models Look Human To AI Detectors

Hugging Face Daily Papers

A research paper finds that base language models appear human to AI detectors, unlike instruction-tuned models. The authors propose a paraphrasing pipeline (HIP) that improves human-likeness while preserving semantics across model sizes.