Optuna Constrained Tree-Structured Parzen Estimator Is a Joint Density Generalization of c-TPE
Summary
This paper demonstrates that Optuna's constrained Tree-Structured Parzen Estimator (TPE) is a joint density generalization of the c-TPE algorithm, showing its invariance to constraint duplication while independent c-TPE degrades. The authors outline practical tradeoffs and directions for future study.
View Cached Full Text
Cached at: 06/10/26, 06:16 AM
# Optuna Constrained Tree-Structured Parzen Estimator Is a Joint Density Generalization of c-TPE
Source: [https://arxiv.org/html/2606.09889](https://arxiv.org/html/2606.09889)
###### Abstract
Constrained hyperparameter optimization \(HPO\) is common in practice, yet Optuna’s widely used constrained TPE lacks algorithmic analysis\. While c\-TPE proposes an expected constrained improvement \(ECI\) approach assumingindependencebetween the objective and constraints, Optuna uses a single joint density over both\. We show that Optuna’s constrained TPE isjoint c\-TPE—the same ECI acquisition function using a joint likelihood\. We demonstrate joint c\-TPE is invariant to constraint duplication whereas independent c\-TPE degrades as the product accumulates duplicated factors\. We outline practical tradeoffs between the formulations and directions for future study\.
## 1Introduction
The performance of deep learning algorithms is sensitive to hyperparameter \(HP\) selection, which can be formulated as an optimization problem𝒙⋆∈argmin𝒙f\(𝒙\)\\boldsymbol\{x\}^\{\\star\}\\in\\mathop\{\\mathrm\{argmin\}\}\_\{\\boldsymbol\{x\}\}f\(\\boldsymbol\{x\}\)\. Optuna\(Akiba et al\.,,[2019](https://arxiv.org/html/2606.09889#bib.bib1)\)is a de facto HPO framework whose default algorithm is tree\-structured Parzen estimator \(TPE\)\(Bergstra et al\.,,[2011](https://arxiv.org/html/2606.09889#bib.bib2); Watanabe,,[2023](https://arxiv.org/html/2606.09889#bib.bib9)\)\. TPE models the likelihood of HPs given observations𝒟=\{\(𝒙n,fn\)\}n=1N\\mathcal\{D\}=\\\{\(\\boldsymbol\{x\}\_\{n\},f\_\{n\}\)\\\}\_\{n=1\}^\{N\}, where𝒙n\\boldsymbol\{x\}\_\{n\}is thenn\-th HP, e\.g\., learning rate, andfnf\_\{n\}is the corresponding objective value, e\.g\., error rate, via a split at the top\-γ\\gammaquantile:
p\(𝒙\|f,𝒟\)=\{p\(𝒙\|𝒟\(l\)\)\(f≤fγ\)p\(𝒙\|𝒟\(g\)\)\(f\>fγ\)p\(\\boldsymbol\{x\}\|f,\\mathcal\{D\}\)=\\begin\{cases\}p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)&\(f\\leq f^\{\\gamma\}\)\\\\ p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\)&\(f\>f^\{\\gamma\}\)\\end\{cases\}\\vskip\-7\.11317pt\(1\)wherefγf^\{\\gamma\}is the top\-γ\\gammaquantile objective value,𝒟\(l\)\\mathcal\{D\}^\{\(l\)\}and𝒟\(g\)\\mathcal\{D\}^\{\(g\)\}contain observations below \(better\) and above \(worse\)fγf^\{\\gamma\}, and both densities are modeled by Parzen estimators, also known as kernel density estimators \(KDEs\)\.Watanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\)showed that the density ratio is proportional to the probability of improvementp\(𝒙\|𝒟\(l\)\)/p\(𝒙\|𝒟\(g\)\)∝ℙ\(℧≤℧γ∣↶\)p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)/p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\)\\propto\\mathbb\{P\}\(f\\leq f^\{\\gamma\}\\mid\\boldsymbol\{x\}\)\.
While the aforementioned TPE formulation solves single\-objective HPO nicely, real\-world HPO often involves black\-box constraints such as memory budget or inference latency\. Previously,Watanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\)proposed a constrained TPE using expected constrained improvement \(ECI\)\(Gardner et al\.,,[2014](https://arxiv.org/html/2606.09889#bib.bib4); Gelbart et al\.,,[2014](https://arxiv.org/html/2606.09889#bib.bib5)\), assumingindependencebetween the objective and constraints:
ECIfγ\[𝒙\|𝒄⋆,𝒟\]=EIfγ\[𝒙\|𝒟\]∏i=1Cℙ\(≤ℶ\|ℶ⋆↶,𝔻\)\\mathrm\{ECI\}\_\{f^\{\\gamma\}\}\[\\boldsymbol\{x\}\|\\boldsymbol\{c\}^\{\\star\},\\mathcal\{D\}\]=\\mathrm\{EI\}\_\{f^\{\\gamma\}\}\[\\boldsymbol\{x\}\|\\mathcal\{D\}\]\\prod\_\{i=1\}^\{C\}\\mathbb\{P\}\(\{\}\_\{i\}\\leq\{\}\_\{i\}^\{\\star\}\|\\boldsymbol\{x\},\\mathcal\{D\}\)\\vskip\-2\.84526pt\(2\)whereci\(𝒙\)≤ci⋆c\_\{i\}\(\\boldsymbol\{x\}\)\\leq c\_\{i\}^\{\\star\}fori∈\{1,…,C\}i\\in\\\{1,\\ldots,C\\\}are black\-box constraints\.
Optuna’s TPE also supports constrained optimization, but its implementation has no formal documentation and differs from the method known as c\-TPE in the literature\. Optuna builds only one pair of KDEs, whereasWatanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\)buildC\+1C\+1pairs of KDEs\. Despite this implementation gap, we found that both approaches share the same theoretical foundation of expected constrained improvement \(ECI\)\. Accordingly, we narrow the definition of c\-TPE given inWatanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\)to formulations that specifically use ECI as their acquisition function\. Under this definition, we show that Optuna’s constrained TPE, dubbedjoint c\-TPE, is a joint density generalization of the original c\-TPE, dubbedindependent c\-TPE\. Note that although independent c\-TPE is also available via OptunaHub\(Ozaki et al\.,,[2026](https://arxiv.org/html/2606.09889#bib.bib8)\), this algorithm is not included in the main package\. We empirically demonstrate the key advantage of joint c\-TPE, which avoids the relative importance dilution that independent formulations suffer under constraint correlation\. Finally, we discuss the tradeoffs and unexplored open problems in joint c\-TPE\.
## 2Findings: Optuna Constrained TPE Is a Joint Density Generalization of c\-TPE
In this paper, we analyze the implementation present in Optuna v4\.8\. To formalize this approach, we rely on the following assumption regarding the marginal constraint densityp\(𝒄\|𝒟\)=∫p\(f,𝒄\|𝒟\)𝑑fp\(\\boldsymbol\{c\}\|\\mathcal\{D\}\)=\\int p\(f,\\boldsymbol\{c\}\|\\mathcal\{D\}\)dfin the feasible region:
###### Assumption 1\.
Given the feasible regionΩ≔∏i=1C\(−∞,ci⋆\]\\Omega\\coloneqq\\prod\_\{i=1\}^\{C\}\(\-\\infty,c\_\{i\}^\{\\star\}\],γ≤∫𝐜∈Ωp\(𝐜\|𝒟\)𝑑𝐜\\gamma\\leq\\int\_\{\\boldsymbol\{c\}\\in\\Omega\}p\(\\boldsymbol\{c\}\|\\mathcal\{D\}\)\\,d\\boldsymbol\{c\}holds\.
Informally, this assumption requires that the feasible region contains at least a fractionγ\\gammaof observations where𝒄∈Ω\\boldsymbol\{c\}\\in\\Omega, ensuring sufficient feasible observations exist in𝒟\(l\)\\mathcal\{D\}^\{\(l\)\}\. Section[4](https://arxiv.org/html/2606.09889#S4)discusses limitations of this assumption\. The Optuna approach first splits observations𝒟=\{\(𝒙n,fn,𝒄n\)\}n=1N\\mathcal\{D\}=\\\{\(\\boldsymbol\{x\}\_\{n\},f\_\{n\},\\boldsymbol\{c\}\_\{n\}\)\\\}\_\{n=1\}^\{N\}into feasible𝒟feas=\{\(𝒙n,fn,𝒄n\)∈𝒟∣𝒄∈Ω\}\\mathcal\{D\}\_\{\\mathrm\{feas\}\}=\\\{\(\\boldsymbol\{x\}\_\{n\},f\_\{n\},\\boldsymbol\{c\}\_\{n\}\)\\in\\mathcal\{D\}\\mid\\boldsymbol\{c\}\\in\\Omega\\\}, and infeasible set𝒟∖𝒟feas\\mathcal\{D\}\\setminus\\mathcal\{D\}\_\{\\mathrm\{feas\}\}\. Then it sorts the feasible observations by objective valuefnf\_\{n\}and fills𝒟\(l\)\\mathcal\{D\}^\{\(l\)\}with the top⌈γN⌉\\lceil\\gamma N\\rceilfeasible observations\. The next HP is selected by maximizing the density ratiop\(𝒙\|𝒟\(l\)\)/p\(𝒙\|𝒟\(g\)\)p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)/p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\)where𝒟\(g\)≔𝒟∖𝒟\(l\)\\mathcal\{D\}^\{\(g\)\}\\coloneqq\\mathcal\{D\}\\setminus\\mathcal\{D\}^\{\(l\)\}\. Importantly, the Optuna approach computes the likelihood of the objective and constraintsjointly:
p\(𝒙\|f,𝒄,𝒟\)=\{p\(𝒙\|𝒟\(l\)\)\(f≤fγand𝒄∈Ω\)p\(𝒙\|𝒟\(g\)\)\(otherwise\)\.p\(\\boldsymbol\{x\}\|f,\\boldsymbol\{c\},\\mathcal\{D\}\)=\\begin\{cases\}p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)&\(f\\leq f^\{\\gamma\}\\text\{ and \}\\boldsymbol\{c\}\\in\\Omega\)\\\\ p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\)&\(\\text\{otherwise\}\)\\end\{cases\}\.\(3\)This formulation induces the following proposition:
###### Proposition 1\.
Under the joint likelihood in Eq\. \([3](https://arxiv.org/html/2606.09889#S2.E3)\),ECIfγ\[𝐱\|𝐜⋆,𝒟\]≃rankp\(𝐱\|𝒟\(l\)\)/p\(𝐱\|𝒟\(g\)\)\\mathrm\{ECI\}\_\{f^\{\\gamma\}\}\[\\boldsymbol\{x\}\|\\boldsymbol\{c\}^\{\\star\},\\mathcal\{D\}\]\\stackrel\{\{\\scriptstyle\\mathrm\{rank\}\}\}\{\{\\simeq\}\}p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)/p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\), where rank equivalencef\(𝐱\)≃rankg\(𝐱\)f\(\\boldsymbol\{x\}\)\\stackrel\{\{\\scriptstyle\\mathrm\{rank\}\}\}\{\{\\simeq\}\}g\(\\boldsymbol\{x\}\)means the ordering is preserved:f\(𝐱\)≤f\(𝐱′\)⇔g\(𝐱\)≤g\(𝐱′\)f\(\\boldsymbol\{x\}\)\\leq f\(\\boldsymbol\{x\}^\{\\prime\}\)\\Leftrightarrow g\(\\boldsymbol\{x\}\)\\leq g\(\\boldsymbol\{x\}^\{\\prime\}\)\.
###### Proof\.
Applying Bayes’ rule toECIfγ\[𝒙∣𝒄⋆,𝒟\]∝ℙ\(℧≤℧γ,∈Ω∣↶,𝔻\)\\mathrm\{ECI\}\_\{f^\{\\gamma\}\}\[\\boldsymbol\{x\}\\mid\\boldsymbol\{c\}^\{\\star\},\\mathcal\{D\}\]\\propto\\mathbb\{P\}\(f\\leq f^\{\\gamma\},\\boldsymbol\{c\}\\in\\Omega\\mid\\boldsymbol\{x\},\\mathcal\{D\}\)shown byWatanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\):
ECIfγ\[𝒙∣𝒄⋆,𝒟\]∝ℙ\(℧≤℧γ,∈Ω∣↶,𝔻\)\\displaystyle\\mathrm\{ECI\}\_\{f^\{\\gamma\}\}\[\\boldsymbol\{x\}\\mid\\boldsymbol\{c\}^\{\\star\},\\mathcal\{D\}\]\\propto\\mathbb\{P\}\(f\\leq f^\{\\gamma\},\\boldsymbol\{c\}\\in\\Omega\\mid\\boldsymbol\{x\},\\mathcal\{D\}\)=∫\(f,𝒄\)∈ℱγ×Ωp\(𝒙\|f,𝒄,𝒟\)p\(f,𝒄\|𝒟\)p\(𝒙\|𝒟\)𝑑f𝑑𝒄\\displaystyle=\\int\_\{\(f,\\boldsymbol\{c\}\)\\in\\mathcal\{F\}^\{\\gamma\}\\times\\Omega\}\\frac\{p\(\\boldsymbol\{x\}\|f,\\boldsymbol\{c\},\\mathcal\{D\}\)\\,p\(f,\\boldsymbol\{c\}\|\\mathcal\{D\}\)\}\{p\(\\boldsymbol\{x\}\|\\mathcal\{D\}\)\}\\,df\\,d\\boldsymbol\{c\}=p\(𝒙\|𝒟\(l\)\)p\(𝒙\|𝒟\)∫\(f,𝒄\)∈ℱγ×Ωp\(f,𝒄\|𝒟\)𝑑f𝑑𝒄⏟=γ\\displaystyle=\\frac\{p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)\}\{p\(\\boldsymbol\{x\}\|\\mathcal\{D\}\)\}\\underbrace\{\\int\_\{\(f,\\boldsymbol\{c\}\)\\in\\mathcal\{F\}^\{\\gamma\}\\times\\Omega\}p\(f,\\boldsymbol\{c\}\|\\mathcal\{D\}\)\\,df\\,d\\boldsymbol\{c\}\}\_\{=\\gamma\}∝\(γ\+\(1−γ\)p\(𝒙\|𝒟\(g\)\)p\(𝒙\|𝒟\(l\)\)\)−1≃rankp\(𝒙\|𝒟\(l\)\)p\(𝒙\|𝒟\(g\)\)\\displaystyle\\propto\\quantity\(\\gamma\+\(1\-\\gamma\)\\frac\{p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\)\}\{p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)\}\)^\{\-1\}\\stackrel\{\{\\scriptstyle\\mathrm\{rank\}\}\}\{\{\\simeq\}\}\\frac\{p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)\}\{p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\)\}\(4\)whereℱγ≔\(−∞,fγ\]\\mathcal\{F\}^\{\\gamma\}\\coloneqq\(\-\\infty,f^\{\\gamma\}\]is theγ\\gamma\-quantile objective region on the feasible region, and the last line usesp\(𝒙\|𝒟\)=γp\(𝒙\|𝒟\(l\)\)\+\(1−γ\)p\(𝒙\|𝒟\(g\)\)p\(\\boldsymbol\{x\}\|\\mathcal\{D\}\)=\\gamma\\,p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(l\)\}\)\+\(1\-\\gamma\)\\,p\(\\boldsymbol\{x\}\|\\mathcal\{D\}^\{\(g\)\}\), obtained by marginalizing the joint likelihood in Eq\. \([3](https://arxiv.org/html/2606.09889#S2.E3)\) over all\(f,𝒄\)\(f,\\boldsymbol\{c\}\)\. ∎
This result has three notable properties\. First, when all observations are feasible, the joint split shown in Eq\. \([3](https://arxiv.org/html/2606.09889#S2.E3)\) reduces to the standard TPE split shown in Eq\. \([1](https://arxiv.org/html/2606.09889#S1.E1)\)\. Second, theγ\\gamma\-quantile valuefγf^\{\\gamma\}constrained on the feasible region is identical to c\-TPE’s split for the objective component\. Third, and most importantly, the joint formulation uses one KDE pair whose split accounts for all constraints simultaneously, whereas independent c\-TPE usesC\+1C\+1pairs and treats each constraint marginally\. This means duplicating a constraint does not change the feasible/infeasible partition under joint c\-TPE, making the acquisition function invariant to constraint duplication unlike independent c\-TPE\. We empirically verify the last property in the next section\.
## 3Empirical Evaluations
We compare independent and joint c\-TPE on a 2D toy problem, minimizingf\(x,y\)=\(x−2\)2\+\(y−2\)2f\(x,y\)=\(x\-2\)^\{2\}\+\(y\-2\)^\{2\}subject to two constraintsc1\(x,y\)=x≤0,c2\(x,y\)=y≤0c\_\{1\}\(x,y\)=x\\leq 0,c\_\{2\}\(x,y\)=y\\leq 0over\[−5,5\]2\[\-5,5\]^\{2\}\. We consider two scenarios: \(1\) independent constraintsc1,c2c\_\{1\},c\_\{2\}, and \(2\)c1c\_\{1\}and many duplicated copies ofc2c\_\{2\}\(perfectly correlated constraints\)\. Comparing the scenario with many duplicated constraints \(Bottom\) to the independent constraint scenario \(Figure[1](https://arxiv.org/html/2606.09889#S3.F1)\(Top\)\), independent c\-TPE degrades because each duplicate adds a factor to the product of relative density ratios, accumulating constraint information and overweighting feasibility along theyy\-axis relative to thexx\-axis\. Meanwhile, joint c\-TPE is invariant to duplication since the feasible/infeasible partition remains unchanged\. Note, however, that independent c\-TPE outperformed joint c\-TPE in the independent constraint scenario because the scenario aligns better with the independent formulation\.
Figure 1:The comparison of independent c\-TPE \(Left\) and joint c\-TPE \(Right\) on a 2D problem with5050observations\. The shaded area shows the feasible region, and the color gradation shows the objective value, which is better when the color is darker\. Each dot represents an observation\. The dots are colored based on the observation order; black means early observations, and white means later observations\.Top: The case without duplicated constraints\. Independent c\-TPE explores at the boundary aggressively by exploiting the independence of the constraint pair\.Bottom: The case with many duplicated constraints; its infeasible region shows much darker coloring\. While joint c\-TPE exhibits the identical sampling behavior, independent c\-TPE shows highly conservative exploration along thexx\-axis because the feasibility along theyy\-axis is overweighted\.
## 4Limitations & Conclusion
A primary methodological limitation of joint c\-TPE comes from Assumption[1](https://arxiv.org/html/2606.09889#Thmassumption1)\. When Assumption[1](https://arxiv.org/html/2606.09889#Thmassumption1)is violated, Optuna fills empty slots in𝒟\(l\)\\mathcal\{D\}^\{\(l\)\}from infeasible observations sorted by total constraint violation\(Deb et al\.,,[2002](https://arxiv.org/html/2606.09889#bib.bib3)\)\. However, the effectiveness and theoretical validity of this fallback strategy remain largely unexplored, unlike independent c\-TPE which has been carefully engineered and analyzed byWatanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\)\. Open questions include: \(1\) which tie\-breaking strategies are viable, \(2\) which performs best, and \(3\) when joint c\-TPE outperforms independent c\-TPE \(and when not\)\. A key structural difference is that independent c\-TPE requires only one feasible solution per constraint individually, whereas joint c\-TPE requires feasibility across*all*constraints simultaneously\. This stricter feasibility requirement can be problematic in independently or severely constrained settings\. Additionally, joint c\-TPE cannot structurally accommodate partial observations \(see Appendix C ofWatanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\)\), which may slow convergence when some constraints are cheaper to evaluate than the objective\.
On the other hand, joint c\-TPE offers complementary advantages\. It is structurally simpler and naturally invariant to constraint duplication—a property independent c\-TPE lacks\. Since joint c\-TPE maintains only one KDE pair rather thanC\+1C\+1pairs, it aligns naturally with multi\-objective TPE formulations\(Ozaki et al\.,,[2020](https://arxiv.org/html/2606.09889#bib.bib7),[2022](https://arxiv.org/html/2606.09889#bib.bib6)\), facilitating future extensions\. Determining when each formulation is preferable remains an open question, requiring systematic empirical comparison across diverse problem structures as future work\.
### Citation Guidance
When using Optuna’s constrained TPE, we recommend citing both this paper andWatanabe and Hutter, \([2023](https://arxiv.org/html/2606.09889#bib.bib10)\), as the latter provides a more detailed treatment of the constrained TPE mechanism and problem setup\.
## References
- Akiba et al\., \(2019\)Akiba, T\., Sano, S\., Yanase, T\., Ohta, T\., and Koyama, M\. \(2019\)\.Optuna: A next\-generation hyperparameter optimization framework\.InACM SIGKDD International Conference on Knowledge Discovery & Data Mining\.
- Bergstra et al\., \(2011\)Bergstra, J\., Bardenet, R\., Bengio, Y\., and Kégl, B\. \(2011\)\.Algorithms for hyper\-parameter optimization\.InAdvances in Neural Information Processing Systems\.
- Deb et al\., \(2002\)Deb, K\., Pratap, A\., Agarwal, S\., and Meyarivan, T\. \(2002\)\.A fast and elitist multiobjective genetic algorithm: NSGA\-II\.IEEE Transactions on Evolutionary Computation, 6\(2\)\.
- Gardner et al\., \(2014\)Gardner, J\. R\., Kusner, M\. J\., Xu, Z\. E\., Weinberger, K\. Q\., and Cunningham, J\. P\. \(2014\)\.Bayesian optimization with inequality constraints\.InInternational Conference on Machine Learning\.
- Gelbart et al\., \(2014\)Gelbart, M\. A\., Snoek, J\., and Adams, R\. P\. \(2014\)\.Bayesian optimization with unknown constraints\.InUncertainty in Artificial Intelligence\.
- Ozaki et al\., \(2022\)Ozaki, Y\., Tanigaki, Y\., Watanabe, S\., Nomura, M\., and Onishi, M\. \(2022\)\.Multiobjective tree\-structured Parzen estimator\.Journal of Artificial Intelligence Research, 73\.
- Ozaki et al\., \(2020\)Ozaki, Y\., Tanigaki, Y\., Watanabe, S\., and Onishi, M\. \(2020\)\.Multiobjective tree\-structured Parzen estimator for computationally expensive optimization problems\.InGenetic and Evolutionary Computation Conference\.
- Ozaki et al\., \(2026\)Ozaki, Y\., Watanabe, S\., and Yanase, T\. \(2026\)\.OptunaHub: A platform for black\-box optimization\.Journal of Machine Learning Research\.
- Watanabe, \(2023\)Watanabe, S\. \(2023\)\.Tree\-structured Parzen estimator: Understanding its algorithm components and their roles for better empirical performance\.arXiv preprint arXiv:2304\.11127\.
- Watanabe and Hutter, \(2023\)Watanabe, S\. and Hutter, F\. \(2023\)\.c\-TPE: Tree\-structured Parzen estimator with inequality constraints for expensive hyperparameter optimization\.InInternational Joint Conference on Artificial Intelligence\.Similar Articles
Generalized TV--$\ell_p$ Structured Priors for Bayesian $T_1$ Mapping
This paper proposes an extended family of structured spatial priors combining total variation (TV) with ℓ_p norms for Bayesian T1 mapping, enabling uncertainty quantification. The method is evaluated on synthetic and real MRI datasets, showing improved spatial coherence and reduced uncertainty.
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
This paper explores using parameter-efficient fine-tuning (PEFT) as a compact substrate for persistent personal models, studying scaling up, down, and out, and introduces MinT for managing adapters.
Private Adaptive Covariance Estimation via Gaussian Graphical Models
This paper introduces PACE-GGM, a differentially private method for covariance estimation that adaptively selects and measures the most informative entries of the empirical covariance matrix, using Gaussian graphical models for reconstruction. It shows improved estimation error over baselines on real-world data, especially in high-dimensional settings.
TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts
TENP proposes a structured pruning framework for Mixture-of-Experts LLMs that retains important experts and applies neuron pruning to less important ones, achieving high sparsity with minimal accuracy loss on Qwen and DeepSeek models.
Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation
This paper presents a unified theoretical framework for stochastic variance-reduced estimation, deriving high-probability bounds via a new Freedman inequality and improving oracle complexities for constrained optimization.