Longitudinal Multimodal Sensing of Physical Activity and Well-Being in Older Adults

arXiv cs.LG Papers

Summary

This paper presents a longitudinal multimodal study of 66 older adults using wearable sensing and clinical assessments to predict physical activity, sleep duration, and sleep apnea severity, finding that behavioral targets are more predictable and historical features are key predictors.

arXiv:2606.00345v1 Announce Type: new Abstract: Wearable and mobile sensing technologies enable continuous monitoring of human behavior and health in real-world settings. However, predictive modeling in longitudinal multimodal data remains challenging, particularly when targeting complex or clinically derived outcomes. In this work, we present a longitudinal multimodal study of 66 older adults conducted in real-world conditions and combining wearable sensing, behavioral monitoring, and clinical assessments. This setting provides a rare opportunity to study an underrepresented population in long-term, into-the-wild conditions. Building on this dataset, we investigate how the alignment between sensed signals and target variables affects predictive performance across health-related tasks. We design a unified evaluation framework spanning tasks with increasing levels of observability, including Activity Levels prediction, Sleep Duration estimation, and Sleep Apnea Severity classification. Our results reveal a clear gradient of predictability: highly observable behavioral targets achieve robust performance (macro-F1 65%), while more abstract outcomes remain challenging despite consistent improvements over baseline models. Moreover, through explainability analysis, we show that historical features consistently emerge as the most informative predictors, highlighting the central role of longitudinal information.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:41 PM

# Longitudinal Multimodal Sensing of Physical Activity and Well-Being in Older Adults
Source: [https://arxiv.org/html/2606.00345](https://arxiv.org/html/2606.00345)
Mattia G\. Campana1, Marcello Magno2, Lorenza Pratali2, Franca Delmastro1 1IIT\-CNR, Pisa, Italy2IFC\-CNR, Pisa, Italy

###### Abstract

Wearable and mobile sensing technologies enable continuous monitoring of human behavior and health in real\-world settings\. However, predictive modeling in longitudinal multimodal data remains challenging, particularly when targeting complex or clinically derived outcomes\. In this work, we present a longitudinal multimodal study of 66 older adults conducted in real\-world conditions and combining wearable sensing, behavioral monitoring, and clinical assessments\. This setting provides a rare opportunity to study an underrepresented population in long\-term, into\-the\-wild conditions\. Building on this dataset, we investigate how the alignment between sensed signals and target variables affects predictive performance across health\-related tasks\. We design a unified evaluation framework spanning tasks with increasing levels of observability, including*Activity Levels*prediction,*Sleep Duration*estimation, and*Sleep Apnea Severity*classification\. Our results reveal a clear gradient of predictability: highly observable behavioral targets achieve robust performance \(macro\-F1≈65%\\approx 65\\%\), while more abstract outcomes remain challenging despite consistent improvements over baseline models\. Moreover, through explainability analysis, we show that historical features consistently emerge as the most informative predictors, highlighting the central role of longitudinal information\.

## 1Introduction

The widespread adoption of wearable and mobile sensing technologies has opened new opportunities for monitoring human behavior and health in real\-world environments\[[19](https://arxiv.org/html/2606.00345#bib.bib2),[8](https://arxiv.org/html/2606.00345#bib.bib15)\]\. Continuous streams of physiological, behavioral, and contextual data enable the study of daily routines and long\-term health trajectories at an unprecedented scale\. In particular, longitudinal multimodal sensing has emerged as a key paradigm for understanding how lifestyle factors, such as physical activity and sleep, influence well\-being over time\[[4](https://arxiv.org/html/2606.00345#bib.bib16),[30](https://arxiv.org/html/2606.00345#bib.bib18)\]\.

However, despite significant progress, current research is still affected by several structural limitations\. First, most existing datasets are biased toward young or working\-age populations, while older adults, who represent a primary target for preventive healthcare and long\-term monitoring, remain largely underrepresented\[[12](https://arxiv.org/html/2606.00345#bib.bib28)\]\. This limitation is particularly critical, as models trained on younger cohorts often fail to generalize to older populations, where physiological responses, behavioral patterns, and health conditions differ substantially\. Second, although many datasets provide high\-resolution raw sensor signals, they often lack a consistent abstraction layer that enables scalable and interpretable longitudinal analysis\[[24](https://arxiv.org/html/2606.00345#bib.bib22)\]\. In contrast, wearable\-derived behavioral features, aggregated over meaningful time scales \(e\.g\., hourly, daily, or weekly\), offer a more suitable representation for modeling health and well\-being, as they better align with the temporal dynamics of the underlying phenomena while enabling robust long\-term analysis\. Finally, existing datasets rarely combine all desirable properties simultaneously, such as*multimodality*,*long\-term longitudinal coverage*,*in\-the\-wild collection*, and*integration with clinically relevant outcomes*, thus limiting both modeling capabilities and scientific insights\[[15](https://arxiv.org/html/2606.00345#bib.bib12),[17](https://arxiv.org/html/2606.00345#bib.bib14),[30](https://arxiv.org/html/2606.00345#bib.bib18)\]\.

In this work, we address these challenges through a comprehensive longitudinal study of older adults conducted entirely in real\-world conditions\. We collected a multimodal dataset from 66 participants \(mean age72\.6±4\.472\.6\\pm 4\.4years\) over approximately 23 months, spanning multiple experimental phases and combining continuous sensing, structured physical intervention, and repeated clinical assessments\. Participants were equipped with an integrated sensing ecosystem including a*smartwatch*, an*under\-mattress sleep tracker*, a*smart scale*, and a custom*mobile application*, enabling the collection of heterogeneous data streams such as physical activity, cardiovascular signals, sleep dynamics, body composition, contextual information, and self\-reported well\-being measures\. The resulting dataset provides a detailed representation of daily life, collected entirely into\-the\-wild, capturing both short\-term variability and long\-term behavioral trends\.

Due to the sensitive nature of the data and GDPR regulations, access to the dataset will be granted upon signing a Data Transfer Agreement \(DTA\) between the receiving institution and the authors’ institution, only for research purpose\. The interested researchers could send a direct request to the authors including a brief introduction of the research project, and they will receive all the instructions to complete the necessary legal steps and access the data\. The dataset structure and characteristics are designed to support reproducible research and future extensions\. A detailed description of the dataset, its metadata and the instructions for data access request are available at the following link https://aathe\-ds\.github\.io/ \(currently maintaining anonymity for review purpose\)\.

A distinctive aspect of our study is the integration of passive sensing with a structured*Adapted Physical Activity*\(*APA*\) program for older adults and periodic clinical evaluations\. The APA program represents both an intervention and an incentive for the involved subjects, providing important guidelines for physical activity that subjects can maintain in their daily life to improve their well\-being\. This design ensures a high user engagement, a psychological motivation in the use of the technology and allows the observation of natural behavioral patterns, in relation to clinically grounded outcomes, such as functional capacity and sleep\-related conditions\. Moreover, the longitudinal nature of the dataset supports the extraction of historical features, such as lagged values and rolling statistics, that capture temporal dependencies and behavioral routines, providing a richer representation than instantaneous measurements alone\.

Building on this dataset, we investigate the extent to which multimodal wearable\-derived signals can support the prediction of health\-related outcomes at different levels of abstraction\. Rather than focusing on a single task, we adopt a unified perspective based on the concept of*observability*, i\.e\., the degree to which a target variable is directly reflected in the available sensing modalities\. To this end, we design a set of machine learning tasks that span a spectrum of increasing difficulty: \(i\) binary next\-day physical activity prediction \(active vs\. non\-active\), representing a highly observable behavioral signal; \(ii\) binary sleep duration prediction, which reflects an intermediate and indirectly observable construct; and \(iii\) sleep apnea severity prediction, formulated as a three\-class classification problem \(mild, moderate, severe\) based on clinically defined thresholds\.

Our results reveal a clear and consistent gradient of predictability across these tasks\. For the most observable setting, activity prediction achieves robust performance \(balanced accuracy and macro\-F1 around65%65\\%\), despite the use of a strict Leave\-One\-Subject\-Out \(LOSO\) cross\-subject evaluation protocol, which requires models to generalize across heterogeneous individuals\. For sleep duration, performance remains above baselines but exhibits increased variability, reflecting the more complex relationship between wearable signals and sleep\-related outcomes\. The most challenging task is sleep apnea severity prediction, where models must discriminate between three clinically defined classes under weak signal\-to\-label alignment and inherent label ambiguity\. In this setting, models achieve balanced accuracy above50%50\\%and macro\-F1 around50%50\\%, representing substantial improvements over both random \(\+33%\+33\\%\) and majority\-class baselines\. Importantly, these results should be interpreted in light of the intrinsic difficulty of the task: the target variable is not directly observable from wearable signals, but instead emerges from complex physiological processes that are only partially captured by the sensing modalities\.

Overall, our findings suggest that predictive performance in longitudinal multimodal sensing is fundamentally constrained by the degree of alignment between measured signals and target constructs, rather than by model complexity alone\. This perspective provides a principled framework for interpreting both successful and challenging prediction tasks, highlighting the importance of considering observability when designing and evaluating machine learning approaches for health monitoring\.

In summary, this work makes three main contributions:

1. 1\.introduces a longitudinal multimodal dataset of older adults collected in\-the\-wild, combining wearable sensing, mobile data, and clinical assessments over an extended time span;
2. 2\.provides a systematic characterization of data availability, quality, and temporal alignment in real\-world conditions, demonstrating the feasibility of long\-term monitoring in this population;
3. 3\.proposes and validates an observability\-driven framework for analyzing predictive performance across heterogeneous health\-related tasks, offering insights into both the potential and the inherent limitations of wearable\-based inference\.

The rest of the manuscript is organized as follows\. Section[2](https://arxiv.org/html/2606.00345#S2)reviews related work, with a particular focus on longitudinal and multimodal sensing datasets, highlighting their key characteristics and limitations\. Section[3](https://arxiv.org/html/2606.00345#S3)describes the study design and presents the proposed dataset, detailing the population, sensing ecosystem, data collection protocol, and integration architecture\. Section[4](https://arxiv.org/html/2606.00345#S4)provides a comprehensive analysis of the dataset, including data availability, validity, and cross\-modal alignment, followed by both population\-level insights and machine learning experiments designed to assess predictive capabilities across tasks with different levels of observability \(Section[5\.1](https://arxiv.org/html/2606.00345#S5.SS1)\)\. Finally, Section[6](https://arxiv.org/html/2606.00345#S6)concludes the paper, discussing the implications of our findings, outlining potential use cases of the dataset, and highlighting directions for future research\.

## 2Background and Related works

Longitudinal multimodal sensing has emerged as a key paradigm for studying behavioral, physiological, and contextual dynamics in real\-world environments\. Recent datasets such as Tesserae\[[15](https://arxiv.org/html/2606.00345#bib.bib12)\], TILES\-2018\[[17](https://arxiv.org/html/2606.00345#bib.bib14)\], StudentLife\[[26](https://arxiv.org/html/2606.00345#bib.bib17)\], and GLOBEM\[[30](https://arxiv.org/html/2606.00345#bib.bib18)\]have demonstrated the feasibility of continuously collecting heterogeneous data streams, including wearable physiological signals, smartphone\-derived behavioral features, and ecological momentary assessments \(EMAs\), over extended periods\. These efforts highlight the potential of multimodal sensing for capturing real\-world behavioral dynamics, while also exposing challenges related to temporal variability, data sparsity, and population heterogeneity\.

In parallel, recent advances in wearable sensing have shifted from unimodal activity recognition toward multimodal learning frameworks that integrate heterogeneous sensor modalities\. While early human activity recognition \(HAR\) systems were primarily based on inertial measurements, recent work increasingly leverages multimodal inputs, including physiological signals, contextual data, and cross\-device sensing to improve robustness and representation quality\[[21](https://arxiv.org/html/2606.00345#bib.bib21)\]\. For instance, recent studies explore multimodal wearable sensing and representation learning to better capture complex behavioral patterns and improve generalization across settings and populations\[[30](https://arxiv.org/html/2606.00345#bib.bib18),[22](https://arxiv.org/html/2606.00345#bib.bib19)\]\. These approaches highlight the importance of combining complementary sensing modalities to address noise, ambiguity, and variability in real\-world data\. However, despite these advances, most multimodal HAR approaches remain focused on short\-term activity recognition tasks and do not explicitly address longitudinal behavioral dynamics\. In particular, the datasets used in these works are often limited in duration and do not capture long\-term temporal dependencies, which are critical for modeling routines, lifestyle patterns, and health trajectories\.

Recent multimodal datasets such as ExtraSensory\[[24](https://arxiv.org/html/2606.00345#bib.bib22)\]and Lifetrace\[[3](https://arxiv.org/html/2606.00345#bib.bib20)\]represent a significant step toward real\-world deployment, combining wearable sensing, contextual information, and in\-the\-wild data collection\. In particular, Lifetrace provides longitudinal multimodal data collected over extended periods, integrating wearable and contextual signals to study daily behavior in natural settings\. However, these datasets still predominantly focus on young and healthy populations\. While they improve ecological validity and multimodal richness, they present key limitations: data collection is often constrained in duration compared to real\-world longitudinal monitoring, and clinically relevant ground truth remains limited\.

As a result, longitudinal multimodal datasets often provide limited representation of older adults, even though this population has the greatest need for continuous, long\-term monitoring and health evaluation\. These limitations have direct implications for predictive modeling on longitudinal multimodal data\. Existing works tend to focus on highly observable outcomes, such as physical activity levels or coarse behavioral states, where the mapping between sensor signals and labels is relatively direct\. In contrast, predicting more abstract or clinically derived constructs, such as sleep quality or health conditions, is inherently more challenging due to weaker alignment between observed signals and underlying phenomena, as well as increased label uncertainty\. This gap is further exacerbated by the limited availability of reliable ground truth for such outcomes\. Furthermore, the longitudinal nature of these datasets enables the use of historical features, such as lagged observations, rolling statistics, and exponentially weighted summaries, which capture temporal dependencies and behavioral routines\. While these approaches can improve predictive performance, their effectiveness depends strongly on the observability of the target and the temporal variability of the data\. In relatively stable or weakly informative settings, historical summaries may capture general trends but provide limited discriminative power for fine\-grained prediction\. Overall, existing datasets tend to satisfy only a subset of desirable properties—longitudinality, multimodality, in\-the\-wild deployment, wearable sensing, and population diversity—but rarely all simultaneously\. In particular, the combination of long\-term multimodal sensing, real\-world deployment, and a focus on older adult populations remains underexplored\.

In this work, we address these challenges by leveraging a longitudinal multimodal dataset collected in older adults under real\-world conditions, and by systematically analyzing predictive tasks across different levels of observability\. In particular, we investigate how the alignment between sensor signals and target variables, as well as the use of historical features, affects predictive performance across behavioral and clinically derived outcomes\.

## 3Pilot study and Dataset

![Refer to caption](https://arxiv.org/html/2606.00345v1/x1.png)Figure 1:Experimental protocol\.The study is the result of the multidisciplinary collaboration among computer scientists, cardiologists, sport scientists and personal trainers\. The main idea originated by a clinical need to promote physical activity among older adults and to better understand its impact on health and well\-being over time\. Through a strict collaboration we designed a monitoring framework capable of supporting both intervention and assessment, by leveraging longitudinal and multimodal data collected in real\-world settings\. This need is aligned with international recommendations, such as the World Health Organization \(WHO\) guidelines on physical activity for older adults\[[28](https://arxiv.org/html/2606.00345#bib.bib39)\], which recommends 150–300 minutes of moderate physical activity per week\. This roughly corresponds to an additional 2,000–4,000 steps per day that, when combined with habitual daily activity, translates to approximately 6,000–8,000 total steps per day in older adults\. Although step count is a convenient and widely adopted indicator of physical activity, it captures only a limited aspect of individual health, motivating the need to integrate additional behavioral, physiological, and sleep\-related dimensions\. Within this context, our study aims to combine continuous monitoring of multiple domains with predictive modeling to capture how behavioral, physiological, and sleep\-related signals evolve over time and reflect changes in health status\.

To support this objective, we implemented a structured Adapted Physical Activity \(APA\) program, delivered by certified personal trainers in an instrumented gym facility coordinated by the authors’ institution\. The program was offered over a period of six months to three consecutive groups, with sessions held twice per week, thus serving both as the intervention phase and as an incentive for sustained participation in the study\. Each user received a sensing ecosystem that was configured through the personal smartphone and directly at home by researchers, as detailed in Section[3\.2](https://arxiv.org/html/2606.00345#S3.SS2)\. During the training sessions, the personal trainers provided personalized exercise programs to each subject, tailored to the individual characteristics\. The program would also serve as guidance aimed at promoting the continuation of physical activity outside the APA setting\.

As illustrated in Figure[1](https://arxiv.org/html/2606.00345#S3.F1), the study ran from the end of April20242024to the end of March20262026, lasting approximately2323months divided into three phases:Pilot,Phase 1, andPhase 2\. The Pilot phase was designed as a testing phase, both for the technological solutions and for the user engagement\. In each phase a different group of users have been involved\. They attended two training APA sessions per week \(1 hour each\), which included tailored exercise programs designed to improve functional capacity and promote independent living through safe, supervised physical activities\. APA, as defined by the International Federation of Adapted Physical Activity\[[10](https://arxiv.org/html/2606.00345#bib.bib40)\], focuses on mobility, balance, strength, and cardiovascular health, while adapting activities to individual capabilities and limitations\. We maintained a consistent session schedule across all trials, with interruptions due to summer closures of the gym facility or users’ holiday periods \(e\.g\., from July to September20242024during the Pilot, and from July to August20252025within Phase22\), as well as transitions between consecutive trials\. Nevertheless, active users were allowed to continue using the sensing ecosystem at home and in their free time for remote monitoring, although full adherence during these periods has been generally reduced\.

The Pilot spanned from the end of April to the end of September20242024, with APA intervention conducted for approximately two months during May and June\. It provided an initial feasibility and deployment validation stage, supporting the verification of device provisioning, cloud synchronization, service adherence, and backend integration procedures\. Phase11started in October20242024and continued until the end of March20252025, representing the first long\-term monitoring period under stable deployment conditions, covering six months of combined remote sensing and intervention\. Finally, Phase22extended from April20252025to March20262026, with an intermediate summer break in the APA program during July and August\. Across all trials, participants also attended three clinical test sessions forbaseline\(T0\),mid\-term\(T11\), andfollow\-up\(T22\) assessment111Please note that the T11assessment was not performed in the Pilot trial due to the limited duration of this phase\.\. At each session, participants underwent a list of clinical evaluations in order to provide a complete anamnesis, including functional capacity levels, cognitive status, malnutrition, sleep disorders, and emotional states \(see Table[1](https://arxiv.org/html/2606.00345#S3.T1)\.

Table 1:List of clinical tests\.TestAcronymTarget DomainOutputMini Mental State Examination\[[23](https://arxiv.org/html/2606.00345#bib.bib6)\]MMSECognitive status0\-30Mini Nutritional Assessment\[[25](https://arxiv.org/html/2606.00345#bib.bib11)\]MNANutritional status0\.30Pittsburgh Sleep Quality Index\[[16](https://arxiv.org/html/2606.00345#bib.bib7)\]PSQISleep Quality0\-21Depression, Anxiety, and Stress Scales\[[9](https://arxiv.org/html/2606.00345#bib.bib8)\]DASS\-21Mood0\-21International Physical Activity Questionnaire\[[11](https://arxiv.org/html/2606.00345#bib.bib10)\]IPAQSelf\-reported Physical activity LevelMET\-minutes/week6 Minutes Walking Test\[[5](https://arxiv.org/html/2606.00345#bib.bib5)\]6MWTFunctional CapabilitiesDistance covered \(meters\)Timed Up and Go\[[2](https://arxiv.org/html/2606.00345#bib.bib9)\]TUGMobility/Fall riskExecution time \(s\)

In addition, physiological evaluations have been conducted in terms of body composition, ecocardiography, blood pressure and oxygenation\. Ambulatory blood pressure measurement was made wearing a portable device connected to an armband that automatically measures blood pressure at regular intervals \(three readings taken on the left arm, one minute apart\)\[[31](https://arxiv.org/html/2606.00345#bib.bib41)\]\.

The repeated clinical evaluations provided an anchor for long\-term remote monitoring, enabling an objective and clinically\-grounded interpretation of the subjects’ physio\-behavioral trajectories over time\. Eventually, at the end of each trial, participants also completed theSystem Usability Scale\(SUS\) to evaluate the acceptability and usability of the proposed technological solutions\. In addition, all the participants accepted to be recontacted after 6 months and 1 year from the end of the study for a follow\-up evaluation\.

As a result, the overall study design combines structured physical intervention, continuous sensing, and clinical evaluations\. Its multi\-phase approach also supports both cross\-sectional comparisons between different subject cohorts and within\-subject longitudinal analyses spanning both in\- and out\-of\-protocol periods\. The study received Institutional Review Board \(IRB\) approval from the Ethics Committee of the authors’ institution \(full details will be provided upon acceptance to preserve anonymity\), and all participants provided written informed consent\.

### 3\.1Population

Table 2:Population characteristics\.CategoryVariableOverallPilotPhase 1Phase 2DemographicsParticipants \(N\)66202930Gender: Female41 \(62\.1%\)8 \(40\.0%\)17 \(58\.6%\)22 \(73\.3%\)Gender: Male25 \(37\.9%\)12 \(60\.0%\)12 \(41\.4%\)8 \(26\.7%\)Age \(years\)72\.6±\\pm4\.473\.2±\\pm3\.972\.5±\\pm4\.572\.3±\\pm4\.3PhysiologicalBody Mass Index \- BMI \(Kg/m2\)25\.8±\\pm4\.026\.2±\\pm3\.826\.7±\\pm4\.325\.2±\\pm3\.8Fat Mass \(%\)31\.3±\\pm7\.631\.0±\\pm8\.131\.3±\\pm8\.731\.4±\\pm6\.6Fat Mass \(Kg\)22\.3±\\pm7\.323\.8±\\pm8\.322\.8±\\pm7\.921\.8±\\pm6\.8Fat\-Free Mass \- FFM \(Kg\)47\.8±\\pm8\.752\.8±\\pm11\.748\.7±\\pm9\.847\.0±\\pm7\.7Total Body Water \- TBW \(Kg\)35\.0±\\pm6\.438\.7±\\pm8\.635\.7±\\pm7\.234\.4±\\pm5\.7Heart Rate \- HR \(BPM\)70\.2±\\pm11\.465\.7±\\pm8\.667\.8±\\pm10\.772\.5±\\pm11\.8Systolic Blood Pressure \- SBP \(mmHg\)148\.7±\\pm21\.8149\.3±\\pm21\.4147\.8±\\pm23\.4149\.5±\\pm20\.5Diastolic Blood Pressure \- DBP \(mmHg\)79\.9±\\pm10\.980\.8±\\pm6\.479\.0±\\pm9\.780\.6±\\pm11\.9Peripheral Blood Oxygenation \- SpO2\(%\)96\.7±\\pm1\.396\.9±\\pm1\.196\.4±\\pm1\.797\.1±\\pm0\.8

Table[2](https://arxiv.org/html/2606.00345#S3.T2)summarizes the demographic, as well as the baseline anthropometric and physiological characteristics of the study population\. Overall, we enrolled a total ofN=66N=66older adults\. Specifically,2020subjects participated in the Pilot,2929in Phase11, and3030in Phase22\. An overlap of1313subjects occurred between Pilot and Phase11, while Phase22consisted exclusively of new enrollments\. This is related to the limited duration of the pilot phase from which we decide to maintain the subjects who demonstrated a high engagement and adherence both to the APA program and the remote monitoring\. From a demographic perspective, the overall sample was predominantly female, comprising4141women \(62\.1%62\.1\\%\) and2525men \(37\.9%37\.9\\%\)\. However, the gender distribution varied between phases: the Pilot and Phase11cohorts were relatively balanced \(approximately60​–​40%60–40\\%\), while Phase22was characterized by a clear predominance of women \(73\.3%73\.3\\%\)\. This pattern is consistent with previous evidence indicating greater participation of women in structured physical activity interventions among older adults\.

In terms of age distribution, participants had a mean age of72\.6±4\.472\.6\\pm 4\.4years, reflecting a homogeneous older population, as further supported by the comparable mean age across phases and their narrow standard deviations\. A similar pattern of comparability was observed for the remaining characteristics\. Statistical testing using Welch’s t\-test did not reveal significant differences between Phase11and Phase22, nor between Phase22and the Pilot phase\.

Participants were generally in the upper range of normal weight or slightly overweight according to standard BMI categories\[[27](https://arxiv.org/html/2606.00345#bib.bib3)\]\. Mean systolic and diastolic blood pressures were148\.7±21\.8148\.7\\pm 21\.8mmHg and79\.9±10\.979\.9\\pm 10\.9mmHg, respectively, consistent with values commonly observed in older populations, where hypertension is prevalent\[[6](https://arxiv.org/html/2606.00345#bib.bib1)\]\. Some cases could also be associated with white coat hypertension for the presence of a healthcare professional\. Finally, peripheral capillary oxygen saturation \(SpO2\) remained high and stable across phases \(96\.7±1\.3%96\.7\\pm 1\.3\\%\), suggesting healthy respiratory function at baseline\.

### 3\.2Sensing ecosystem

To enable continuous and unobtrusive monitoring of participants in real\-world conditions, we adopted a sensing ecosystem based on commercial off\-the\-shelf devices, thus ensuring ecological validity, measurement reliability, and scalable data management\. The participants were equipped with a*Garmin Venu Sq 2*smartwatch to continuously monitor physiological and activity\-related data\. It integrates optical photoplethysmography \(PPG\) sensors to estimate HR, SpO2,*stress levels*derived from heart rate variability \(HRV\), as well as*breathing rate*\(BR\)\. In addition, it includes an accelerometer to capture activity\-related measures such as*step count*and*activity intensity*\.

Sleep\-related metrics were collected using a*Withings Sleep Analyzer*, an unobtrusive under\-mattress device designed for long\-term monitoring in home environments\. It features a pneumatic sensor that measures HR, BR, and body movements via ballistocardiography \(BCG\), as well as a sound sensor that detects audio signals associated with breathing interruptions\. The device provides a detailed sleep summary, including \(but not limited to\)*sleep stages*\(i\.e\., awake, light, deep, REM\), number of awakenings and out\-of\-bed events, physiological parameters, as well as*snoring*measures\. Importantly, it also estimates the*Apnea–Hypopnea Index*\(AHI\), a clinically validated, medical\-grade indicator representing the average number of apneas \(complete breathing pauses\) and hypopneas \(shallow breathing events\) events per hour of sleep, which is a particularly valuable indicator for assessing sleep\-related breathing disturbances\. Body weight and composition were periodically assessed using a*Garmin Index S2*, with a recommended measurement frequency of at least33times per week\. In addition to*body weight*, the scale performs bioelectrical impedance analysis \(BIA\) to estimate body composition parameters, including*fat mass*,*muscle mass*,*bone mass*, and*total body water*\. It is worth noting that the clinical measurements of body composition relies on a certified medical device, reporting thus a more accurate measurements, but the commercial smart scale allows for repeated measurements at home\. Overall, all these devices offer validated sensing capabilities widely adopted in consumer health ecosystems, ensuring stable and reliable measurements in real\-world conditions\. Moreover, they leverage mature cloud infrastructures for data synchronization, storage, and secure access, which greatly facilitates large\-scale longitudinal data collection and remote monitoring\. By integrating with the vendors’ official cloud architectures, we ensured continuous data availability while minimizing participant burden and technical maintenance\. All data were automatically synchronized with participants’ smartphones via the Garmin Connect and Withings HealthMate companion apps and uploaded to the vendors’ cloud infrastructures for secure retrieval and analysis\. The sensing data collection procedure is detailed in the next subsection\. Notably, once installed, the sleep tracker and smart scale operate passively by sending their data via the user’s home WiFi network, reducing participant burden and promoting adherence\. On the other hand, smartwatch management requires minimal active user interaction, mainly for periodic recharging\.

In addition to the devices described above, we deployed a custom\-developed native mobile application \(for both Android and iOS\) for active and passive data collection\. The former consisted of daily*Ecological Momentary Assessments*\(EMAs\) covering*physical activity*,*social interactions*,*mood*, and*dietary habits*\. These self\-reported data are primarily intended to provide fine\-grained, day\-level ground truth for downstream analyses, but they can also serve as additional contextual features\. For each participant, we scheduled a daily reminder notification after dinner, according to individual preferences, to prompt EMA completion\. The app also provided a summary dashboard displaying recent data from each device, enhancing transparency and participant engagement\. On the other hand, passive sensing includes periodic*human activity recognition*\(HAR\) via on\-device physical and virtual sensors, and*location*tracking, in order to complement physiological and behavioral data\. Both features rely on Google/Apple APIs, with daily readings transmitted at the end of each day through a background task scheduled within the mobile app\. The sensing ecosystem has been deployed before each pilot phase\. Researchers installed and configured the required mobile applications on participants’ personal smartphones, and provided the wearable devices\. The installation and configuration of the remaining monitoring devices required a dedicated deployment at participants’ homes, as these required access to the domestic WiFi network\.

![Refer to caption](https://arxiv.org/html/2606.00345v1/x2.png)Figure 2:Data collection architectureFigure[2](https://arxiv.org/html/2606.00345#S3.F2)illustrates the overall data collection architecture adopted in our study\. The infrastructure was specifically designed to \(i\) maintain a clear separation between device\-specific vendor ecosystems and the research backend, \(ii\) guarantee secure data transmission, and \(iii\) enable daily automated integration of heterogeneous data streams into a unified research dataset\. For each participant, we created a unique, fully anonymized account to serve as the sole identifier across the entire infrastructure, without any reference to the personal data of the associated user\. These IDs were used to set up dedicated accounts on the Garmin222[https://connect\.garmin\.com](https://connect.garmin.com/)and Withings333[https://app\.withings\.com](https://app.withings.com/)cloud platforms and to initialize the corresponding devices\. The same identifier was embedded in our custom mobile application to ensure consistent cross\-platform data linkage\. This approach embeds participant privacy into the system design itself\.

From a data communication perspective, the Garmin smartwatch and smart scale automatically sync with the Garmin cloud, while the sleep tracker uploads data to the Withings cloud upon detecting wake\-up\. These daily synchronizations leverage vendor\-managed infrastructures for secure transmission, storage, and availability, providing industrial\-grade security, reliable service, and robust authentication while minimizing operational and maintenance demands on the research system\. To bridge the gap between vendor cloud platforms and our research repository, we implemented an automated server\-side component, referred to as the*Data Downloader*in Figure[2](https://arxiv.org/html/2606.00345#S3.F2)\. A daily scheduled task programmatically authenticates to the Garmin and Withings cloud services using the anonymized user accounts to retrieve newly available records\. In parallel with vendor\-mediated data flows, the custom smartphone application used a dedicated, fully controlled transmission pathway over HTTPS, ensuring encrypted end\-to\-end transfer of sensitive information\.

## 4Dataset validation

Figure[3](https://arxiv.org/html/2606.00345#S4.F3)provides an overview of the daily data collected throughout the study\. The colored bars indicate the percentage of available data from each of the six sensing modalities, namely smartwatch \(SW\), sleep tracker \(ST\), scale \(BIA\), EMA, and smartphone\-based HAR and location data \(LOC\)\. The three consecutive monitoring trials lasted170,173, and267days, respectively\. This preliminary assessment of daily availability assumes that each user contributes at least with one sample per modality and is expressed relative to the number of active participants in each trial\. In the ideal case, when all participants provide data for all modalities, the bars sum to1\.01\.0\. It is important to note that only a small number of dropouts has been experienced in the entire study, due to personal reasons or sudden health issues\. Specifically, only44subjects from Phase11have been excluded resulting in a final cohort of2525participants for this trial \(6%6\\%drop out on the entire pilot study\)\.

As we can observe, data availability is highest during APA periods across all trials, with daily percentages consistently above75%75\\%and all sensing modalities represented in most cases\. This was expected since APA represented also the incentive to increase the users’ engagement\. However, data from several sources remain available during non\-APA periods as well, with the exception of the gap between Phase11and Phase22, which is due to the deployment transition between two different user groups\. In addition, a long tail of data is observed after December 2025, resulting from delays in retrieving the sensing devices, which led to continued data collection from a subset of volunteer users\. This final phase was excluded from the evaluation process\.

In terms of study phases, we can note a decrease in the central part of Phase22, which is related to the summer break between the22internal APA periods, in which users spent some time on holiday and the hot climate contributed to slightly reduce the engagement in physical activity\. It is worth specifying that participants were instructed to use their devices continuously throughout the APA period, while they were free to use the devices for the rest of the time\. Results highlight that most of the users maintained their engagement also outside the APA periods, providing important data for the long\-term analysis\.

To this aim, we compared the data availability in the full trial with only APA periods that result in50days for the Pilot,162days for Phase11, and179days for Phase22\. Specifically, we analysed device\-specific data by defining the following custom criteria as validity thresholds\. We defined an assessment window from 8:00 A\.M\. to 8:00 P\.M\. for the smartwatch use, and we considered valid those days that have at least75%75\\%of the expected HR measurements within the window, based on the HR sampling rate, which corresponds to a minimum wearing time of88hours\.

For scale data, a day was considered valid if both weight and body composition parameters have been registered\. In fact, in some cases we found only partial measurements due to incorrect use of the scale \(e\.g\., wearing socks or shoes, or standing on the scale for a limited time\)\. When multiple scale measurements were recorded in a day, we used their average values\. For sleep tracker data and EMA, only complete data transmission are allowed by design\. Therefore, if we have data for a day, they are intrinsically valid\.

Figure[4](https://arxiv.org/html/2606.00345#S4.F4)reports the distribution of daily validity rates across sensing modalities and trials\. Each value represents the proportion of valid daily samples aggregated across users within a given modality, visualized via violin and box plots for both APA and full study observation windows\. In general, as expected, the APA periods consistently represent the most reliable observation window across sensing modalities, yielding higher median validity and reduced dispersion, thereby reflecting optimal conditions in which participants are actively engaged and device usage is most controlled\. This is particularly evident for smartwatch and smartphone\-based sensing, where APA boundaries capture stable high\-validity regimes across trials\. However, while a degradation in data quality naturally occur when extending the analysis outside the incentive phase, most data remain usable\. Some particular cases are reduced validity are related to EMAs and BIA, mainly due to the opportunistically use of the related devices\. Considering this situation we can support the use of the full\-period data for longitudinal analysis, yet with increased missingness due to less controlled nature of real\-world data collection\.

![Refer to caption](https://arxiv.org/html/2606.00345v1/x3.png)Figure 3:Dataset overview throughout the study duration\.![Refer to caption](https://arxiv.org/html/2606.00345v1/x4.png)\(a\)Pilot
![Refer to caption](https://arxiv.org/html/2606.00345v1/x5.png)\(b\)Phase 1
![Refer to caption](https://arxiv.org/html/2606.00345v1/x6.png)\(c\)Phase 2

Figure 4:Comparison of data validity distributions between full \(All\) and APA periods across all trials\.![Refer to caption](https://arxiv.org/html/2606.00345v1/x7.png)\(a\)All daily modalities
![Refer to caption](https://arxiv.org/html/2606.00345v1/x8.png)\(b\)All W/o smartphone\-embedded sensing
![Refer to caption](https://arxiv.org/html/2606.00345v1/x9.png)\(c\)Smartwatch and Sleep Tracker

Figure 5:CCDF visualizations of cross\-modal data availability\.### 4\.1Cross\-Modal Data availability and temporal alignment

To assess cross\-modal data availability and temporal alignment, we defined a daily completeness indicator for each user and trial, capturing whether all expected sensing modalities provided valid data within the same day\. We then quantified the proportion of days with complete multimodal coverage across all sources\. We originally discarded BIA value due to the limited number of measurements\.

To characterize the distribution of this measure across participants, we analyzed its complementary cumulative distribution function \(CCDF\), which quantifies the fraction of users achieving at least a given percentage of days with complete cross\-modal data availability within each trial\. This analysis provides a global view of multimodal data coverage, highlighting the extent to which data from different sensing sources are temporally aligned and jointly available for downstream analyses\.

Figure[5](https://arxiv.org/html/2606.00345#S4.F5)illustrates the overall cross\-modal data availability and alignment across different combinations of sensing devices and modalities\. Specifically, Figure[5\(a\)](https://arxiv.org/html/2606.00345#S4.F5.sf1)shows cross\-modal data availability when considering all daily sensing modalities, namely smartwatch, sleep tracker, EMAs, and smartphone\-based location and activity data\. The distribution reveals substantial heterogeneity across users\. At relatively low thresholds, most users satisfy the criterion: approximately90%90\\%of users achieve at least20%20\\%of days with complete cross\-modal data availability\. As the threshold increases, the proportion of users progressively decreases\. About70%70\\%of users achieve at least40%40\\%of such days, while this proportion drops to approximately4545–50%50\\%at a threshold of60%60\\%\. Overall, these results indicate moderate data completeness when considering the joint availability and temporal alignment of all daily modalities\.

The situation changes if we remove smartphone\-embedded sensing, and even more considering two main devices: smartwatches and sleep tracker\. In the first case, the corresponding CCDFs are shifted upward compared to those obtained when all daily modalities are considered\. The situation further improves in Figure[5\(c\)](https://arxiv.org/html/2606.00345#S4.F5.sf3), indicating high completeness levels between the smartwatch and the sleep tracker\. More than80%80\\%of the users provides complete data on at least60%60\\%of days, and roughly60−70%60\-70\\%of users go over80%80\\%of days\. The distributions are broadly consistent across the three trials, with only minor variations\. Overall, these results support the combined use of activity and sleep data for reliable full\-day monitoring of the participants\.

![Refer to caption](https://arxiv.org/html/2606.00345v1/x10.png)\(a\)FC changes between T0 and T2
![Refer to caption](https://arxiv.org/html/2606.00345v1/x11.png)\(b\)Step count
![Refer to caption](https://arxiv.org/html/2606.00345v1/x12.png)\(c\)AHI

Figure 6:Changes in users’ functional capabilities from baseline to follow\-up assessment \(a\)\. Average weekly trends during Phase11for step count \(b\) and AHI \(c\) features, stratified by users’ baseline FC categories \(derived from 6MWT\)\.

## 5Population Stratification and Multimodal Signal Analysis

Starting from the clinical assessments collected at T0and T22, we analyzed the impact of the APA program on older adults’ well\-being, both as a structured intervention and as a driver for adopting healthier lifestyles\. To this end, we considered two key dimensions of the clinical evaluation: functional capacity \(FC\), measured through the Six\-Minute Walk Test \(6MWT\)\[[5](https://arxiv.org/html/2606.00345#bib.bib5)\], and Body Mass Index \(BMI\)\. The 6MWT requires each participant to walk for six minutes along a flat, straight path, and the total distance covered \(in meters\) is used to classify subjects into four FC levels:Poor\(distance<300<300m\),Fair\(300≤300\\leqdistance≤400\\leq 400m\),Good\(400<400<distance≤500\\leq 500m\), andVery Good\(distance\>500\>500m\)\. First, we stratified all participants based on FC, using the evaluations at T0and T22, and analyzed the evolution of each subject across categories, as illustrated in Figure[6\(a\)](https://arxiv.org/html/2606.00345#S4.F6.sf1)\. The Sankey diagram reveals a predominantly stable trend among participants withGoodfunctional capacity:4545out of6262individuals maintained their initial FC level\. In the context of aging populations, such stability represents a clinically meaningful outcome, as preserving functional capacity over time is itself indicative of positive health status\. In addition,1515participants showed a clear improvement, including55subjects who achieved a two\-category improvement, while only22individuals experienced a slight decline, mainly due to concurrent health issues\. Overall, these trends suggest a beneficial effect of the APA program, characterized by both the maintenance of functional capacity in the majority of participants and measurable improvements in a substantial subset of the population, with minimal deterioration observed\.

To further characterize these groups, we analyzed the temporal evolution of behavioral signals in terms of daily steps, and derived sleep\-related outcomes, with particular attention to the Apnea–Hypopnea Index444AHI is a clinically defined index computed as the average number of apnea and hypopnea events per hour of sleep, generally based on multiple physiological signals collected during polysomnography\.\(AHI\), across FC and BMI strata\. Daily step count provides a highly observable and reliable proxy of functional mobility and engagement in physical activity, which is directly measurable from wearable sensors and widely used in both clinical and real\-world monitoring settings\. In contrast, AHI represents a clinically meaningful but less directly observable outcome, reflecting sleep\-related breathing disturbances that are associated with cardiovascular and overall health risks\. By jointly analyzing these two variables, we aim to cover a spectrum of observability, from directly measurable behavioral signals to clinically derived indicators, enabling a more comprehensive characterization of health\-related patterns in the population\.

Temporal trend analyses are reported for Phase 1 only, as this period corresponds to a consistent intervention phase during which participants were regularly engaged in the APA program under stable conditions\. In contrast, Phase 2 is characterized by a non\-stationary structure, including an initial APA period, an intermediate summer break with no structured activity or incentives, and a subsequent reintroduction of the program\. This sequence introduces substantial variability in behavioral patterns, driven by both the temporary interruption of the intervention and seasonal effects, resulting in a not negligible decline in activity levels during the break period\. Consequently, including Phase 2 in the trend analysis would confound the interpretation of temporal dynamics, as observed variations would reflect not only individual behavior but also externally induced changes in engagement conditions\. For this reason, we restricted the trend analysis to Phase 1, in order to ensure a more consistent and interpretable evaluation of the relationship between multimodal signals and functional capacity under stable intervention conditions \(see Figures[6\(c\)](https://arxiv.org/html/2606.00345#S4.F6.sf3)and[6\(b\)](https://arxiv.org/html/2606.00345#S4.F6.sf2)\)\.

When stratifying by FC, we observe a clear and monotonic separation across all groups\. Higher FC levels are consistently associated with higher step counts, with well\-separated trajectories across the observation period\. A similar pattern emerges for AHI, where participants in higher FC groups exhibit systematically lower values, while lower FC groups show progressively higher AHI levels\. This results in a highly structured and stable separation across groups, with limited overlap and consistent ordering over time\. However, these findings should be interpreted with caution, particularly for AHI\. As a derived clinical index, AHI is influenced by multiple factors beyond functional capacity, including age, BMI, and sex, and is therefore not directly determined by behavioral or functional measures alone\. The strong separation observed across FC groups may thus reflect the fact that FC captures a broader latent health construct, which is correlated with these determinants, rather than indicating a direct causal relationship between functional capacity and sleep apnea severity\.

These observations are further supported by statistical analysis of activity\-related data\. One\-way ANOVA indicated a significant overall effect of FC group on daily step counts, with post\-hoc pairwise comparisons showing significant differences across all group pairs \(all adjustedp<0\.01p<0\.01\)\. In particular, comparisons involving theVery GoodFC group showed large effect sizes, suggesting a marked separation from the other categories, although this result can be affected by the relatively small sample size of the group\. However, when accounting for inter\-subject variability through linear mixed\-effects models \(LMEM\), the strength of these differences is reduced\. Using the lowest group \(Poor FC\) as reference, only the contrast with the highest group remains statistically significant, while the intermediate groups show positive but non\-significant differences\. This suggests that, although activity levels broadly follow FC stratification, the most robust behavioral contrasts are primarily observed between the extreme groups, while adjacent categories exhibit more gradual transitions\. Overall, these results indicate that FC effectively captures differences in activity levels, while also highlighting that the separation between neighboring groups is limited, reflecting a continuous rather than sharply discretized functional spectrum\.

This result is consistent with the concept of observability, as step count represents a highly observable behavioral signal closely aligned with functional capacity\. While this leads to strong overall group differences, only the most distinct FC levels remain robustly separable when accounting for inter\-subject variability\.

In contrast, BMI\-based stratification leads to more heterogeneous and less regular patterns \(Figures[7\(a\)](https://arxiv.org/html/2606.00345#S5.F7.sf1)and[7\(b\)](https://arxiv.org/html/2606.00345#S5.F7.sf2)\)\. While the obesity group consistently exhibits lower activity levels and higher AHI values—consistent with established clinical evidence—the separation between normal and overweight groups is less pronounced, particularly for activity\. These observations are supported by statistical analysis\. Post\-hoc pairwise comparisons using Dunn’s test revealed significant differences across BMI groups for both daily step count and AHI, confirming a progressive reduction in activity with increasing BMI\. However, linear mixed\-effects models did not identify significant group\-specific temporal slopes, indicating that while groups differ in their overall levels, their trajectories over time remain relatively stable\. A similar pattern is observed for AHI, where obese individuals consistently fall within the severe sleep apnea range, whereas normal\-weight and overweight groups are predominantly distributed between mild and moderate levels\. Also in this case, LMEM analysis did not reveal significant temporal interactions, suggesting that AHI behaves as a relatively stable, subject\-specific characteristic over the observation period\. While this is consistent with the known physiological nature of sleep apnea, the observed stability may also reflect the specific composition of the study cohort, and should therefore be interpreted with caution when generalizing to broader populations\. Overall, these results indicate that the apparent strength of relationships in multimodal sensing data depends critically on the choice of stratification variable\. FC\-based grouping provides stronger separability and clearer trends, but may partially reflect underlying confounding factors, while BMI\-based grouping offers a more conservative and clinically grounded perspective, better capturing the inherent variability of physiological conditions\.

Building on these observations, we next shift from stratified analysis to individual\-level predictive modeling, where performance directly reflects the observability of each target, allowing us to evaluate how well multimodal signals can support outcome inference without relying on predefined group structures\.

![Refer to caption](https://arxiv.org/html/2606.00345v1/x13.png)\(a\)Step count
![Refer to caption](https://arxiv.org/html/2606.00345v1/x14.png)\(b\)AHI

Figure 7:Average weekly trends during Phase11stratified by users’ baseline BMI\.### 5\.1Predictive modelling evaluation

To complement the stratified analysis, we formulate a set of predictive tasks aimed at quantifying the extent to which multimodal wearable signals can support the inference of health\-related outcomes at the individual level\. Rather than treating each task independently, we adopt a unified perspective based on the notion of*observability*, where target variables differ in how directly they are reflected in wearable sensor data\. We hypothesize that predictive performance is not solely driven by model complexity, but is primarily determined by the degree of alignment between the measured signals and the underlying latent construct\. To this end, we design a set of experiments spanning targets with progressively lower levels of observability, ranging from directly measurable behavioral signals \(i\.e\., daytime activity levels\), to moderately derived outcomes \(i\.e\., sleep\-related variables\), up to clinically defined indicators \(i\.e\., sleep severity proxies\)\.

As a first step, we consider a binary*activity classification*task, namelyActivity Levels, where the goal is to forecast, one day ahead, whether a subject exceeds a predefined daily activity threshold\. Given healthy older adults as the reference population, we set a threshold of66K daily steps to distinguish betweenactiveandnon\-activesubjects\. This task represents the most direct mapping between sensor signals and target labels, as physical activity is explicitly captured by wearable devices, particularly through smartwatch\-derived measurements\. As such, it serves as a high\-observability reference task, providing an upper bound on predictive performance when the target variable is strongly aligned with the available sensing modalities\.

We then considerSleep durationas a proxy for sleep quality, defining a binary prediction task based on whether the subject achieves a sufficient amount of sleep\. Unlike activity, sleep duration is not directly measured but inferred from multiple sensor\-derived signals, making it inherently less observable\. While a conventional threshold of88hours is typically used, we adopt a threshold of66hours to better reflect the characteristics of the older population, where shorter sleep durations are more common\[[7](https://arxiv.org/html/2606.00345#bib.bib35)\]\. This choice had a negligible impact on the class distribution, as using the higher threshold would result in only a0\.4%0\.4\\%difference with respect to the class proportions shown in Figure[8\(b\)](https://arxiv.org/html/2606.00345#S5.F8.sf2)\.

Finally, we consider the taskSleep AHIin which we evaluate the prediction of*AHI severity*\. According to clinical guidelines, it is categorized into three classes:mild\(AHI<15<15\),moderate\(15≤AHI≤3015\\leq\\mathrm\{AHI\}\\leq 30\), andsevere\(AHI\>30\>30\)\. This task introduces two key challenges\. First, there is a weak coupling between wearable\-derived signals and the underlying clinical condition, as sleep apnea is only indirectly reflected in the available measurements\. Second, the discretization of a continuous physiological variable into fixed thresholds introduces label ambiguity, particularly near class boundaries\. As a result, this task operates in a low\-observability regime, where prediction is fundamentally constrained by limited signal\-to\-label alignment\.

Across all tasks, we employ a consistent modeling and evaluation pipeline in order to isolate the effect of task observability from methodological variability\. This allows us to interpret performance differences as a function of signal\-to\-label alignment, rather than dataset\-or model\-specific artifacts\. Overall, our experimental design is intended to characterize a gradient of predictability across different constructs, ranging from strongly observable behavioral signals to weakly identifiable clinical abstractions, in order to better understand both successful and failure cases\.

Table 3:Summary of feature types used in the machine learning experiments\.Data TypeFeatureDimensionStepsDaytime stats13×\\times1HRDaytime stats9×\\times1RRDaytime stats8×\\times1StressDaytime stats8×\\times1SpO2Daytime stats8×\\times1SleepSleep summary \+ derived feature41×\\times1LaggedValues over previous 3 days3×\\timesfeatureRollingWindow stats \(3 and 7 days\)24×\\timeswindow×\\timesfeatureEWMWindow stats \(3 and 7 days\)3×\\timeswindow×\\timesfeatureStaticAge, Gender, BMI3×\\times1ContextualCalendar\-derived14×\\times1#### 5\.1\.1Feature engineering and extraction

TheActivity Levelstask focuses on predicting the next\-day physical activity level using heterogeneous information derived from sleep and behavioral data\. To this end, we leverage*sleep summary*features extracted from the previous night, together with smartwatch\-derived*step count*and*activity statistics*from the previous day\. These are complemented by static*demographic variables*\(i\.e\., age, gender, BMI\) and*calendar\-based*contextual features, such as the day of the week and holidays\.

Moreover, to capture temporal dependencies, we construct a set of*historical features*for each variable by aggregating past observations over multiple time scales\. Specifically, we consider: \(i\)lagged features, representing daily values over the previousN=3N=3days; \(ii\)rolling statistics, computed over fixed windows of33and77days to capture short\- and medium\-term trends; and \(iii\)exponentially weighted moving\(EWM\) statistics, computed over the entire subject history to emphasize recent observations while retaining long\-term context\. We adopt two EWM configurations with half\-life parameters set to33and77days, analogous to the rolling windows, to model different temporal memory scales\.

TheSleep DurationandSleep AHItasks follow a consistent feature construction pipeline, enabling a direct comparison across different levels of observability\. In these settings, we use multimodal smartwatch\-derived signals from the previous day, including step count, activity levels, HR, BR, stress, and SpO2, together with*sleep\-related*information from the previous night\. Static and contextual predictors, as well as the same set of historical features, are also included\.

Across all tasks, we restrict smartwatch data to daytime measurements \(between88A\.M\. and88P\.M\.\) to exclude samples potentially associated with sleep periods\. From these data, we compute a set of descriptive statistics \(e\.g\., minimum, maximum, mean, median, and coverage\), along with modality\-specific features such as resting heart rate and the proportion of time spent in sedentary, active, and highly active states\. For sleep data, we focus on nocturnal events within the complementary time window and include all aggregated metrics provided by sleep summaries\. We further derive additional features capturing sleep structure and quality, including*sleep stage ratios*,*timing indicators*\(e\.g\., midpoint hour\), and fragmentation measures\.

For rolling windows, we compute a comprehensive set of2424statistical descriptors\. However, not all of these are suitable for EWM representations\. Since EWM assigns exponentially decaying weights to past observations, order\-based or boundary\-dependent statistics \(e\.g\., minimum, maximum, and coverage\) become less meaningful or ill\-defined\. Therefore, for EWM features we adopt a reduced yet informative set of statistics, including the mean, standard deviation, and the average absolute difference between consecutive observations\. A summary of all feature types is reported in Table[3](https://arxiv.org/html/2606.00345#S5.T3)\.

Table 4:Dataset statistics before and after preprocessing\.TaskBeforeAfterSubjectsObservationsFeaturesSubjectsObservationsFeaturesActivity Levels6464102411024127312731535357985798101101Sleep Duration64649292929222532253515147234723102102Sleep AHI63638015801522532253474743824382139139Table 5:Hyperparameter search space and total number of configurations for each model\.ModelHyperparameterSearch Space\# ConfigsLRCC\[0\.01, 0\.1, 1\.0, 10\.0\]8penalty\[l2\]solver\[lbfgs, saga\]RFn\_estimators\[100, 300\]48max\_depth\[None, 10, 20\]min\_samples\_split\[2, 5\]min\_samples\_leaf\[1, 2\]max\_features\[sqrt, log2\]MLPhidden\_layer\_sizes\[\(64\), \(128\), \(64,32\)\]24activation\[relu, tanh\]alpha\[1e\-4, 1e\-3\]learning\_rate\_init\[1e\-3, 1e\-2\]![Refer to caption](https://arxiv.org/html/2606.00345v1/x15.png)\(a\)Activity Levels
![Refer to caption](https://arxiv.org/html/2606.00345v1/x16.png)\(b\)Sleep duration
![Refer to caption](https://arxiv.org/html/2606.00345v1/x17.png)\(c\)Sleep AHI

Figure 8:Global class distributions across predictive tasks\.
#### 5\.1\.2AI pipeline

Following feature extraction, we design a structured AI pipeline to enable a robust evaluation of cross\-subject generalization across the 3 prediction tasks\. As a first step, we apply a conservative preprocessing strategy to ensure that models are trained exclusively on reliable observations\. Specifically, we remove features with a high proportion of missing values \(\>30%\>30\\%\), and subsequently discard samples containing missing entries\. While this choice reduces the overall dataset size, it avoids introducing additional assumptions related to data imputation, which is beyond the scope of this work\. This approach ensures that performance differences across tasks can be attributed to their intrinsic observability, rather than to preprocessing artifacts\.

We then perform correlation\-based feature selection to retain the most informative variables\. Features exhibiting weak absolute correlation with the target are discarded, resulting in a more compact and stable representation that reduces the risk of overfitting\. To further ensure statistical reliability, we retain only subjects with a sufficient number of daily observations \(\>30\>30\)\. Finally, all features are standardized to mitigate inter\-subject variability and improve comparability across individuals\. Dataset statistics, including the number of subjects, observations, and features before and after preprocessing, are reported in Table[4](https://arxiv.org/html/2606.00345#S5.T4)\.

In our experiments, we evaluate three standard models, namely Logistic Regression \(LR\), Random Forest \(RF\), and Multilayer Perceptron \(MLP\), under a consistent experimental protocol\. Model performance is assessed using a nested cross\-validation framework\. In the outer loop, we adopt Leave\-One\-Subject\-Out \(LOSO\) cross\-validation to obtain an unbiased estimate of generalization performance on unseen individuals, iteratively holding out each subject as the test set\. Within the inner loop, we perform subject\-wise K\-group cross\-validation \(withK=4K=4\) for hyperparameter tuning and model selection via grid search\. The corresponding search spaces are reported in Table[3](https://arxiv.org/html/2606.00345#S5.T3)\.

We use the macro F1\-score as the primary optimization metric to account for class imbalance\. Class distributions for each task are reported in Figure[8\(a\)](https://arxiv.org/html/2606.00345#S5.F8.sf1)\. After selecting the optimal hyperparameters, each model is retrained on the full inner training set and evaluated on the held\-out subject\. This procedure ensures that the reported performance reflects true subject\-level generalization\. For each model and task, we report the distribution of macro F1\-score, as well as standard Accuracy \(Acc\) and Balanced accuracy \(BA\) across the testing subjects\.

![Refer to caption](https://arxiv.org/html/2606.00345v1/x18.png)\(a\)Activity Levels
![Refer to caption](https://arxiv.org/html/2606.00345v1/x19.png)\(b\)Sleep duration
![Refer to caption](https://arxiv.org/html/2606.00345v1/x20.png)\(c\)Sleep AHI

Figure 9:Cross\-subject performance distribution across predictive tasks\.
#### 5\.1\.3Classification Results

Figure[9](https://arxiv.org/html/2606.00345#S5.F9)shows the classification performance of the considered models across the three target tasks\. For completeness, we also include accuracy, although this metric is less informative when considered in isolation, particularly in the presence of class imbalance\. To provide a meaningful reference, we compare all models against two simple baselines: a*Random Guesser*\(*RG*\), which assigns labels uniformly at random, and a*Majority\-class Classifier*\(*MC*\), which always predicts the most frequent class\. Overall, the results reveal a clear gradient of predictability across tasks, consistent with their level of observability\.

For theActivity Levelstask, all models achieve strong and consistent improvements over the baselines, indicating that the target is clearly learnable and well\-aligned with the available sensing modalities\. RF attains the best overall performance, reaching an average BA of65\.3%65\.3\\%\(corresponding to a\+15\.3%\+15\.3\\%improvement over baselines\) and a macro\-F1 of65\.4%65\.4\\%, with gains of\+27\.7%\+27\.7\\%over the majority classifier and\+15\.4%\+15\.4\\%over RG\. MLP follows closely \(BA65\.0%65\.0\\%, macro\-F161\.8%61\.8\\%\), while LR also shows competitive performance \(BA59\.4%59\.4\\%, macro\-F157\.8%57\.8\\%\)\. The consistency of these gains across different model families suggests that performance is not driven by a specific architecture, but rather by the strength of the underlying signal\. In terms of accuracy, RF achieves the highest value \(81\.5%81\.5\\%\), significantly exceeding MC; importantly, these gains are aligned with improvements in class\-balanced metrics, indicating genuine discriminative capability rather than bias toward the dominant class\. While overall performance is solid, residual variability across subjects \(standard deviation between11%11\\%and15%15\\%\) highlights that predictability is not uniform, leaving room for further improvements in generalization\.

For theSleep Durationtask, all models again outperform both baselines in terms of class\-balanced metrics, although performance is generally lower and more variable, reflecting the reduced observability of the target\. LR achieves the strongest balanced performance \(BA60\.4%60\.4\\%, macro\-F155\.7%55\.7\\%\), followed closely by RF \(BA58\.2%58\.2\\%, macro\-F156\.1%56\.1\\%\) and MLP \(BA57\.6%57\.6\\%, macro\-F153\.0%53\.0\\%\)\. These results indicate that meaningful predictive signals are present, even though the relationship between wearable data and sleep duration is less direct\. In contrast, accuracy is less informative in this setting due to pronounced class imbalance: while RF achieves the highest value \(84\.8%84\.8\\%\), improvements over MC are limited, and LR and MLP remain close to it, confirming that accuracy is largely driven by class proportions\. A key characteristic of this task is the substantial cross\-subject variability, with standard deviations ranging from approximately13%13\\%to22%22\\%\. Notably, the distributions are skewed toward higher values, with several subjects exhibiting near\-perfect BA and F1 scores\. This suggests that certain individuals are inherently easier to predict, likely due to stronger signal\-to\-label alignment, while others remain significantly more challenging, contributing to the observed variability\.

Finally, theSleep AHItask represents the most challenging setting, as it involves the prediction of a clinically defined outcome characterized by weak coupling with wearable\-derived signals\. Despite this, all models substantially outperform both baselines on the most informative metrics\. In particular, BA exceeds48%48\\%for all models, with RF achieving the best performance \(BA51\.8%51\.8\\%\), corresponding to a\+18\.5%\+18\.5\\%improvement over RG\. Similarly, macro\-F1 increases from33\.3%33\.3\\%\(RG\) and24\.4%24\.4\\%\(MC\) to approximately50%50\\%for LR and RF, confirming that the models capture meaningful patterns beyond trivial baselines\. Accuracy comparisons are again less straightforward due to class imbalance; nevertheless, all models surpass the majority classifier \(Acc57\.7%57\.7\\%\), with RF reaching73\.8%73\.8\\%and LR71\.6%71\.6\\%, while MLP achieves lower performance \(63\.3%63\.3\\%\)\. Interestingly, LR performs competitively with RF in terms of macro\-F1, suggesting that even linear models can capture relevant structure in this task\. However, performance variability across subjects is particularly high \(standard deviation≈18\\approx 18–23%23\\%\), indicating that the task remains inherently difficult\. As in the previous task, some subjects exhibit near\-perfect performance, suggesting that prediction is highly dependent on individual\-specific factors such as data quality, behavioral consistency, or class separability\.

In conclusion, all the results demonstrate clear improvements over baselines and confirm the presence of exploitable signals\. However, achieving robust and consistent performance across individuals remains challenging, which can be partly attributed to the high temporal stability of the underlying datas\. The limited intra\-subject variability reduces the availability of discriminative patterns, making it more difficult for models to generalize reliably across different individuals\.

#### 5\.1\.4Explainability

![Refer to caption](https://arxiv.org/html/2606.00345v1/x21.png)Figure 10:Comparative heatmap of the top1515most important features across LR, RF, and MLP models for the sleep AHI prediction task\.To better understand the factors driving model predictions and assess the relevance of the input features, we performed an explainability analysis\. For each model, we computed feature attribution scores on the corresponding test set, using the most appropriate method for each model family\. For LR, we directly use the learned coefficients as a measure of feature importance, while for the remaining models, we rely on SHapley Additive exPlanations \(SHAP\)\[[13](https://arxiv.org/html/2606.00345#bib.bib36)\]to estimate feature contributions\. In particular, we use TreeSHAP\[[14](https://arxiv.org/html/2606.00345#bib.bib42)\]to obtain exact Shapley value computations for RF, while KernelSHAP\[[13](https://arxiv.org/html/2606.00345#bib.bib36)\]is employed to approximate feature attributions for the MLP\. For each model, we then computed the mean absolute importance across test subjects, allowing us to rank features and assess their consistency across model families\.

Figure[10](https://arxiv.org/html/2606.00345#S5.F10)presents the resulting feature importance heatmap for theSleep AHItask, shown as a representative example among the three analyzed settings\. The corresponding visualizations for theActivity LevelsandSleep Durationtasks are shown in Figure[11](https://arxiv.org/html/2606.00345#A1.F11)in the Appendix for completeness\. To enable a direct comparison across models, feature importance scores are first normalized within each model, and then aggregated to compute a global ranking of the top\-1515features\. The heatmap displays these features ordered by decreasing global importance, highlighting both shared and model\-specific patterns\. As we can note, the highest agreement is observed between LR and RF, which share1111out of the top\-1515features \(73\.3%73\.3\\%\)\. In contrast, the MLP exhibits substantially lower agreement, with only22features shared with either LR or RF \(13\.3%13\.3\\%\)\. This discrepancy may partially stem from the approximate nature of KernelSHAP, which can introduce additional variability in feature attribution for non\-linear models\.

A similar analysis across the other tasks reveals a consistent pattern, although with varying degrees of agreement\. For theSleep Durationtask, LR and RF still show the strongest overlap, albeit reduced to66shared features \(40\.0%40\.0\\%\), while the MLP again exhibits limited agreement \(between6\.7%6\.7\\%and13\.3%13\.3\\%\)\. In contrast, for theActivity Levelstask, the agreement is more evenly distributed across all model pairs, with77shared features between RF and MLP \(46\.7%46\.7\\%\),55between LR and RF \(33\.3%33\.3\\%\), and44between LR and MLP \(26\.7%26\.7\\%\)\. This more balanced overlap suggests the absence of a dominant pairwise agreement, and instead indicates a broader consensus across models, which tend to focus on similar sets of informative features\.

To provide a more systematic comparison, Table[6](https://arxiv.org/html/2606.00345#S5.T6)reports the*Jaccard Index*for each model pair across tasks, quantifying the overlap between feature sets as the ratio between their intersection and union\. The table confirms the trends observed in the heatmaps, highlighting substantial variability in agreement depending on the task\. In particular, theSleep AHItask exhibits a strong alignment between LR and RF \(57\.9%57\.9\\%\), while agreement with MLP remains minimal\. TheSleep Durationtask shows moderate agreement \(25\.0%25\.0\\%\) limited to LR and RF, whereas all other pairs exhibit very low similarity\. Finally, theActivity Levelstask is characterized by a more distributed overlap, with no clearly dominant pair, reflecting a more consistent cross\-model consensus\. Overall, these results suggest that agreement across models is higher in tasks where the underlying signal is more structured or strongly aligned with the feature space, while lower agreement may indicate either weaker observability or increased model sensitivity to different aspects of the data\.

Another key aspect emerging from this analysis is that the top\-ranked features are consistentlyhistoricalin nature, particularly those derived from medium\- to long\-term temporal windows \(e\.g\., one week\)\. This indicates that the learning process is primarily driven by temporally aggregated predictors, which capture behavioral regularities unfolding over extended periods\. Notably, this trend is consistent across all tasks, suggesting that the underlying phenomena exhibit strong temporal dependencies that cannot be fully captured by short\-term observations alone\.

These findings further emphasize the importance of longitudinal monitoring, as it enables both improved predictive performance and the extraction of stable and informative patterns that would otherwise remain hidden in short\-term data\.

Table 6:Jaccard Index \(%\\%\) between model pairs across tasks\.Model PairTaskLR–RFRF–MLPMLP–LRActivity Levels20\.030\.415\.4Sleep Duration25\.07\.17\.1Sleep AHI57\.93\.57\.1

## 6Conclusions and Future work

In this work, we presented a longitudinal multimodal study of older adults conducted in real\-world conditions, combining wearable sensing, mobile data collection, and periodic clinical assessments over an extended observation period\. The resulting dataset provides a rich representation of daily life collected into\-the\-wild, capturing both short\-term behavioral variability and long\-term health\-related trajectories\. In this way, this study contributes to bridge the gap between controlled experimental settings and real\-world health monitoring\.

Beyond the dataset itself, we proposed a unified analysis framework based on the concept of observability, investigating how the alignment between wearable\-derived signals and target variables affects predictive performance\. Through a set of machine learning experiments that span tasks with increasing levels of abstraction, ranging from activity prediction to clinically defined sleep outcomes, we showed that predictive performance is strongly influenced by the intrinsic observability of the target\. Although highly observable behavioral signals can be reliably inferred, more abstract or clinically grounded constructs remain challenging, despite the presence of meaningful predictive patterns\. These findings provide a principled lens for interpreting both the potential and limitations of wearable\-based inference, particularly in longitudinal settings\.

Importantly, this work specifically targets an older population, which remains underrepresented in existing wearable datasets and modeling efforts\. Our results highlight both the feasibility and the challenges of predictive modeling in this context, where inter\-subject variability, heterogeneous behavioral patterns, and complex physiological dynamics require careful modeling choices and evaluation protocols\. In this sense, our study contributes not only a dataset, but also methodological insights that are directly relevant for the design of predictive systems tailored to aging populations\.

Looking ahead, several research directions emerge from this work\. First, while our experiments rely on classical machine learning models with engineered features, future work could explore more advanced time\-series modeling approaches, including*deep learning architectures*specifically designed for sequential data, such as*recurrent*,*convolutional*, or*transformer\-based*models\[[1](https://arxiv.org/html/2606.00345#bib.bib34)\]\. These methods could better capture temporal dependencies and long\-range patterns that are only partially represented through lagged and rolling features\. At the same time, hybrid approaches combining feature\-based representations with temporal models may provide a promising trade\-off between interpretability and predictive power\. Second, the longitudinal nature of the dataset opens the door to alternative modeling paradigms beyond per\-sample prediction\. In particular,*feature\-level longitudinal modeling*and*subject\-centric representations*could be leveraged to better capture individual behavioral trajectories, enabling personalized modeling strategies and improving cross\-subject generalization\[[18](https://arxiv.org/html/2606.00345#bib.bib33)\]\. This is especially relevant in older populations, where variability across individuals is often more pronounced and clinically meaningful\. Finally, our dataset and findings contribute to the emerging area of*Wearable Behavioural Foundation Models*\(*WBFMs*\)\[[20](https://arxiv.org/html/2606.00345#bib.bib32),[29](https://arxiv.org/html/2606.00345#bib.bib31)\], which aim to learn general\-purpose representations from large\-scale, multimodal behavioral data\. By providing a longitudinal, multimodal, and clinically anchored dataset collected into\-the\-wild, this work offers a valuable resource for pretraining and evaluating such models\. Future research could explore self\-supervised and representation learning approaches to extract transferable embeddings, enabling more robust and scalable inference across tasks and populations\.

In conclusion, this work highlights both the opportunities and the inherent challenges of longitudinal multimodal sensing for health monitoring in older adults\. By combining dataset design, empirical analysis, and predictive modeling, we provide a step toward more realistic, interpretable, and generalizable wearable\-based health inference systems, while outlining a path for future research at the intersection of sensing, machine learning, and digital health\.

## References

- \[1\]G\. Augustinov, M\. A\. Nisar, F\. Li, A\. Tabatabaei, M\. Grzegorzek, K\. Sohrabi, and S\. Fudickar\(2023\)Transformer\-based recognition of activities of daily living from wearable sensor data\.InProceedings of the 7th International Workshop on Sensor\-Based Activity Recognition and Artificial Intelligence,iWOAR ’22,New York, NY, USA\.External Links:ISBN 9781450396240,[Document](https://dx.doi.org/10.1145/3558884.3558895)Cited by:[§6](https://arxiv.org/html/2606.00345#S6.p4.1)\.
- \[2\]O\. Beauchet, B\. Fantino, G\. Allali, S\. Muir, M\. Montero\-Odasso, and C\. Annweiler\(2011\)Timed up and go test and risk of falls in older adults: a systematic review\.The journal of nutrition, health & aging15\(10\),pp\. 933–938\.Cited by:[Table 1](https://arxiv.org/html/2606.00345#S3.T1.4.1.8.1)\.
- \[3\]F\. Bombassei De Bona, I\. A\. Câmpanu, M\. Langheinrich, M\. Gjoreski, and G\. Juravle\(2025\-12\)LIFETRACE: a longitudinal multimodal dataset on daily physical activity, well\-being, and habits\.Proc\. ACM Interact\. Mob\. Wearable Ubiquitous Technol\.9\(4\)\.External Links:[Link](https://doi.org/10.1145/3770676),[Document](https://dx.doi.org/10.1145/3770676)Cited by:[§2](https://arxiv.org/html/2606.00345#S2.p3.1)\.
- \[4\]V\. P\. Cornet and R\. J\. Holden\(2018\)Systematic review of smartphone\-based passive sensing for health and wellbeing\.Journal of Biomedical Informatics77,pp\. 120–132\.External Links:ISSN 1532\-0464,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jbi.2017.12.008),[Link](https://www.sciencedirect.com/science/article/pii/S1532046417302782)Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p1.1)\.
- \[5\]P\. L\. Enright\(2003\)The six\-minute walk test\.Respiratory Care48\(8\),pp\. 783–785\.External Links:[Document](https://dx.doi.org/10.4187/respcare.03480783)Cited by:[Table 1](https://arxiv.org/html/2606.00345#S3.T1.4.1.7.1),[§5](https://arxiv.org/html/2606.00345#S5.p1.15)\.
- \[6\]A\. Garg and B\. J\. Messinger\-Rapport\(2018\-03\)Hypertension in older adults: what is the target blood pressure?\.Cleve\. Clin\. J\. Med\.85\(3\),pp\. 193–195\(en\)\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.3949/ccjm.85a.17064)Cited by:[§3\.1](https://arxiv.org/html/2606.00345#S3.SS1.p3.4)\.
- \[7\]B\. Ha, M\. Han, W\. So, and S\. Kim\(2024\)Sex differences in the association between sleep duration and frailty in older adults: evidence from the knhanes study\.BMC geriatrics24\(1\),pp\. 434\.Cited by:[§5\.1](https://arxiv.org/html/2606.00345#S5.SS1.p3.3)\.
- \[8\]G\. M\. Harari, N\. D\. Lane, R\. Wang, B\. S\. Crosier, A\. T\. Campbell, and S\. D\. Gosling\(2016\)Using smartphones to collect behavioral data in psychological science: opportunities, practical considerations, and challenges\.Perspectives on Psychological Science11\(6\),pp\. 838–854\.Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p1.1)\.
- \[9\]J\. D\. Henry and J\. R\. Crawford\(2005\)The short\-form version of the depression anxiety stress scales \(dass\-21\): construct validity and normative data in a large non\-clinical sample\.British journal of clinical psychology44\(2\),pp\. 227–239\.Cited by:[Table 1](https://arxiv.org/html/2606.00345#S3.T1.4.1.5.1)\.
- \[10\]International Federation of Adapted Physical Activity\(2020\)What is adapted physical activity?\.External Links:[Link](https://ifapa.net/)Cited by:[§3](https://arxiv.org/html/2606.00345#S3.p3.6)\.
- \[11\]P\. H\. Lee, D\. J\. Macfarlane, T\. H\. Lam, and S\. M\. Stewart\(2011\)Validity of the international physical activity questionnaire short form \(ipaq\-sf\): a systematic review\.International journal of behavioral nutrition and physical activity8\(1\),pp\. 115\.Cited by:[Table 1](https://arxiv.org/html/2606.00345#S3.T1.4.1.6.1)\.
- \[12\]N\. Liu, J\. Yin, S\. S\. Tan, K\. Y\. Ngiam, and H\. H\. Teo\(2021\-10\)Mobile health applications for older adults: a systematic review of interface and persuasive feature design\.J Am Med Inform Assoc28\(11\),pp\. 2483–2501\(en\)\.Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p2.1)\.
- \[13\]S\. M\. Lundberg and S\. Lee\(2017\)A unified approach to interpreting model predictions\.Advances in neural information processing systems30\.Cited by:[§5\.1\.4](https://arxiv.org/html/2606.00345#S5.SS1.SSS4.p1.1)\.
- \[14\]S\. M\. Lundberg, G\. G\. Erion, and S\. Lee\(2018\)Consistent individualized feature attribution for tree ensembles\.CoRRabs/1802\.03888\.External Links:[Link](http://arxiv.org/abs/1802.03888),1802\.03888Cited by:[§5\.1\.4](https://arxiv.org/html/2606.00345#S5.SS1.SSS4.p1.1)\.
- \[15\]S\. Mattinglyet al\.\(2019\)The tesserae project: large\-scale, longitudinal, in situ, multimodal sensing of information workers\.InProceedings of the CHI Conference on Human Factors in Computing Systems,Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p2.1),[§2](https://arxiv.org/html/2606.00345#S2.p1.1)\.
- \[16\]T\. Mollayeva, P\. Thurairajah, K\. Burton, S\. Mollayeva, C\. M\. Shapiro, and A\. Colantonio\(2016\)The pittsburgh sleep quality index as a screening tool for sleep dysfunction in clinical and non\-clinical samples: a systematic review and meta\-analysis\.Sleep medicine reviews25,pp\. 52–73\.Cited by:[Table 1](https://arxiv.org/html/2606.00345#S3.T1.4.1.4.1)\.
- \[17\]K\. Mundnichet al\.\(2020\)TILES\-2018: a longitudinal physiological and behavioral dataset of hospital workers\.arXiv preprint arXiv:2003\.08474\.Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p2.1),[§2](https://arxiv.org/html/2606.00345#S2.p1.1)\.
- \[18\]J\. Onnela and S\. L\. Rauch\(2016\-06\-01\)Harnessing smartphone\-based digital phenotyping to enhance behavioral and mental health\.Neuropsychopharmacology41\(7\),pp\. 1691–1696\.External Links:ISSN 1740\-634X,[Document](https://dx.doi.org/10.1038/npp.2016.7),[Link](https://doi.org/10.1038/npp.2016.7)Cited by:[§6](https://arxiv.org/html/2606.00345#S6.p4.1)\.
- \[19\]L\. Piwek, D\. A\. Ellis, S\. Andrews, and A\. Joinson\(2016\-02\)The rise of consumer health wearables: promises and barriers\.PLOS Medicine13\(2\),pp\. 1–9\.External Links:[Document](https://dx.doi.org/10.1371/journal.pmed.1001953),[Link](https://doi.org/10.1371/journal.pmed.1001953)Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p1.1)\.
- \[20\]M\. Qiu, C\. Weng, M\. Fan, and K\. Wu\(2025\-09\)Towards customizable foundation models for human activity recognition with wearable devices\.Proc\. ACM Interact\. Mob\. Wearable Ubiquitous Technol\.9\(3\)\.External Links:[Link](https://doi.org/10.1145/3749479),[Document](https://dx.doi.org/10.1145/3749479)Cited by:[§6](https://arxiv.org/html/2606.00345#S6.p4.1)\.
- \[21\]A\. D\. Rossi, D\. Marzorati, T\. Gerosa, R\. Švihrová, S\. Santini, and F\. Faraci\(2025\-09\)Unobtrusive perceived sleep quality monitoring in the wild\.Proc\. ACM Interact\. Mob\. Wearable Ubiquitous Technol\.9\(3\)\.External Links:[Link](https://doi.org/10.1145/3749502),[Document](https://dx.doi.org/10.1145/3749502)Cited by:[§2](https://arxiv.org/html/2606.00345#S2.p2.1)\.
- \[22\]A\. Saeed, T\. Ozcelebi, and J\. Lukkien\(2019\-06\)Multi\-task self\-supervised learning for human activity detection\.Proc\. ACM Interact\. Mob\. Wearable Ubiquitous Technol\.3\(2\)\.External Links:[Link](https://doi.org/10.1145/3328932),[Document](https://dx.doi.org/10.1145/3328932)Cited by:[§2](https://arxiv.org/html/2606.00345#S2.p2.1)\.
- \[23\]T\. N\. Tombaugh and N\. J\. McIntyre\(1992\)The mini\-mental state examination: a comprehensive review\.Journal of the American Geriatrics Society40\(9\),pp\. 922–935\.Cited by:[Table 1](https://arxiv.org/html/2606.00345#S3.T1.4.1.2.1)\.
- \[24\]Y\. Vaizman, K\. Ellis, and G\. Lanckriet\(2017\)ExtraSensory: a multimodal dataset for context recognition in the wild\.InProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies,Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p2.1),[§2](https://arxiv.org/html/2606.00345#S2.p3.1)\.
- \[25\]B\. Vellas, Y\. Guigoz, P\. J\. Garry, F\. Nourhashemi, D\. Bennahum, S\. Lauque, and J\. Albarede\(1999\)The mini nutritional assessment \(mna\) and its use in grading the nutritional state of elderly patients\.Nutrition15\(2\),pp\. 116–122\.Cited by:[Table 1](https://arxiv.org/html/2606.00345#S3.T1.4.1.3.1)\.
- \[26\]R\. Wang, F\. Chen, Z\. Chen, T\. Li, G\. Harari, S\. Tignor, X\. Zhou, D\. Ben\-Zeev, and A\. T\. Campbell\(2014\)StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones\.InProceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing,UbiComp ’14,New York, NY, USA,pp\. 3–14\.External Links:ISBN 9781450329682,[Link](https://doi.org/10.1145/2632048.2632054),[Document](https://dx.doi.org/10.1145/2632048.2632054)Cited by:[§2](https://arxiv.org/html/2606.00345#S2.p1.1)\.
- \[27\]J\. E\. Winter, R\. J\. MacInnis, N\. Wattanapenpaiboon, and C\. A\. Nowson\(2014\)BMI and all\-cause mortality in older adults: a meta\-analysis\.Obesity22\(1\),pp\. –\.External Links:[Document](https://dx.doi.org/10.1002/oby.21612)Cited by:[§3\.1](https://arxiv.org/html/2606.00345#S3.SS1.p3.4)\.
- \[28\]World Health Organization\(2020\)WHO guidelines on physical activity and sedentary behaviour\.World Health Organization\.External Links:[Link](https://www.who.int/publications/i/item/9789240015128)Cited by:[§3](https://arxiv.org/html/2606.00345#S3.p1.1)\.
- \[29\]Y\. Y\. Wu, Y\. Zhang, H\. Yoon, T\. Dang, D\. Spathis, T\. Xia, Q\. Yang, J\. Han, D\. Ma, S\. Lee, and C\. Mascolo\(2026\)Wearable foundation models should go beyond static encoders\.External Links:2603\.19564,[Link](https://arxiv.org/abs/2603.19564)Cited by:[§6](https://arxiv.org/html/2606.00345#S6.p4.1)\.
- \[30\]X\. Xu, X\. Liu, H\. Zhang, W\. Wang, S\. Nepal, Y\. Sefidgar, W\. Seo, K\. S\. Kuehn, J\. F\. Huckins, M\. E\. Morris, P\. S\. Nurius, E\. A\. Riskin, S\. Patel, T\. Althoff, A\. Campbell, A\. K\. Dey, and J\. Mankoff\(2023\-01\)GLOBEM: cross\-dataset generalization of longitudinal human behavior modeling\.Proc\. ACM Interact\. Mob\. Wearable Ubiquitous Technol\.6\(4\)\.External Links:[Link](https://doi.org/10.1145/3569485),[Document](https://dx.doi.org/10.1145/3569485)Cited by:[§1](https://arxiv.org/html/2606.00345#S1.p1.1),[§1](https://arxiv.org/html/2606.00345#S1.p2.1),[§2](https://arxiv.org/html/2606.00345#S2.p1.1),[§2](https://arxiv.org/html/2606.00345#S2.p2.1)\.
- \[31\]D\. Y\. Zhang, D\. W\. An, Y\. L\. Yu, J\. D\. Melgarejo, J\. Boggia, D\. S\. Martens, T\. W\. Hansen, K\. Asayama, T\. Ohkubo, K\. Stolarz\-Skrzypek, S\. Malyutina, E\. Casiglia, L\. Lind, G\. E\. Maestre, J\. G\. Wang, Y\. Imai, K\. Kawecka\-Jaszcz, E\. Sandoya, M\. Rajzer, T\. S\. Nawrot, E\. O’Brien, W\. Y\. Yang, J\. Filipovský, A\. Graciani, J\. R\. Banegas, Y\. Li, J\. A\. Staessen, and International Database of Ambulatory Blood Pressure in Relation to Cardiovascular Outcomes Investigators\(2025\)Ambulatory blood pressure monitoring, european guideline targets, and cardiovascular outcomes: an individual patient data meta\-analysis\.European Heart Journal46\(30\),pp\. 2974–2987\.External Links:[Document](https://dx.doi.org/10.1093/eurheartj/ehafXXX)Cited by:[§3](https://arxiv.org/html/2606.00345#S3.p5.1)\.

## Appendix

## Appendix AAdditional Feature Importance Visualizations

For completeness, we report in this appendix the feature importance heatmaps for theActivity Levels\(Figure[11\(a\)](https://arxiv.org/html/2606.00345#A1.F11.sf1)\) andSleep Duration\(Figure[11\(b\)](https://arxiv.org/html/2606.00345#A1.F11.sf2)\) tasks, complementing theSleep AHIanalysis presented in the main text\. These visualizations are obtained using the same methodology, where feature importance is estimated for each model, normalized within\-model, and aggregated to derive a global ranking of the top\-1515features\.

Overall, the heatmaps confirm the trends discussed in the main paper\. While the degree of agreement across models varies depending on the task, the most relevant predictors consistently correspond to*historical features*, i\.e\., features that capture information from past observations over extended time horizons\. This pattern is particularly evident in theActivity Levelstask, where a relatively higher agreement across models is associated with a shared focus on temporally aggregated information\. Similarly, in theSleep Durationtask, although the overlap across models is more limited, the top\-ranked features still predominantly reflect historical information rather than instantaneous measurements\.

In conclusion, these results reinforce the key finding that longitudinal information plays a central role in model performance across all tasks\. The consistent prominence of historical features suggests that the relevant behavioral and physiological patterns evolve over time and cannot be fully captured by short\-term observations alone\.

![Refer to caption](https://arxiv.org/html/2606.00345v1/x22.png)\(a\)Activity Level
![Refer to caption](https://arxiv.org/html/2606.00345v1/x23.png)\(b\)Sleep Duration

Figure 11:Comparative heatmap of the top1515most important features across LR, RF, and MLP models for the a\) Activity Levels and b\) Sleep Duration tasks\.

Similar Articles