When Do Data-Driven Systems Exhibit the Capability to Infer?

arXiv cs.AI 06/11/26, 04:00 AM Papers
ai-act regulation inference credit-scoring statistical-learning european-union
Summary
This paper develops a framework to grade the capability to infer in data-driven systems under the European AI Act, using credit scoring as a case study to illustrate where inference occurs and where regulatory clarity is needed.
arXiv:2606.11769v1 Announce Type: new Abstract: The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at https://github.com/fraunhofer-iais/inference-framework-creditscorecards.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:48 PM
# When Do Data-Driven Systems Exhibit the Capability to Infer?
Source: [https://arxiv.org/html/2606.11769](https://arxiv.org/html/2606.11769)
Maximilian PoretschkinFraunhofer Institute for Intelligent Analysis and Information Systems \(IAIS\)University of BonnLamarr Institute for Machine Learning and Artificial IntelligenceSankt Augustin, BonnGermany[maximilian\.poretschkin@iais\.fraunhofer\.de](https://arxiv.org/html/2606.11769v1/mailto:[email protected])Tabea NaevenFraunhofer Institute for Intelligent Analysis and Information Systems \(IAIS\)Sankt AugustinGermany[tabea\.naeven@iais\.fraunhofer\.de](https://arxiv.org/html/2606.11769v1/mailto:[email protected])

###### Abstract\.

The European AI Act is the first comprehensive regulation of artificial intelligence \(AI\), setting out extensive obligations, particularly for so\-called high\-risk and general\-purpose AI systems\. A key distinguishing feature of AI systems under the AI Act is the capability to infer\. Since the AI Act does not clearly define what inference is, there is a gray area for certain data\-driven systems\. A specific example is credit scoring systems, which are listed by Annex III of the AI Act\. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all\.

Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer\. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed\. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them\. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered\. It also shows that the involvement of human experts during development can have significant influence on the capability to infer\. Code can be found at[https://github\.com/fraunhofer\-iais/inference\-framework\-credit\-scorecards](https://github.com/fraunhofer-iais/inference-framework-credit-scorecards)\.

AI regulation, AI Act, AI definition, Inference, Credit Scoring

††ccs:Social and professional topics Governmental regulationsMonthly IncomeNumber of times30\-59 days past dueValuePointsValuePointsmissing value208\[0, 3000\]\-41\-13\(3000, 5000\]\-3≥\\geq2\-29\(5000, 7000\]0\(7000, 10000\]3≥\\geq100005Scorecards like this can have a huge impact on customers\. Are they a result of artificial intelligence as defined by the European AI Act?## 1\.Introduction

The European AI Act is the first comprehensive legal framework for artificial intelligence \(AI\) to come into force, subjecting high\-risk systems in particular to strict regulatory requirements\(European Commission,[2021](https://arxiv.org/html/2606.11769#bib.bib16)\)\. The question of what constitutes an AI system is therefore central to the scope of this legal framework\.

The AI Act opts for a technology\-neutral definition that closely follows international preparatory work, particularly that of the OECD\(OECD,[2024](https://arxiv.org/html/2606.11769#bib.bib18)\), and formulates the capability to infer as the key distinguishing feature of AI systems\. The inference of the AI system refers to the capability “to derive from received inputs for explicit or implicit goals how outputs such as predictions, recommendations, or decisions are generated\.” However, despite its regulatory importance, the concept of inference remains under\-specified, particularly with respect to systems that rely on classical statistical models rather than contemporary machine learning architectures\. Important examples for such statistical models are linear and logistic regression models\.

In practice, the discrepancy is particularly evident in credit scoring workflows: Annex III of the AI Act lists credit scoring\. This implies that credit scoring systems are classified as high\-risk systems by the AI Act, provided that the criteria outlined by Article 6 are met\. However, their implementation is often based on \(partially\) automated binning procedures and logistic regression models\. During the European Commission’s consultation process on the definition of AI\(European Commission,[2024](https://arxiv.org/html/2606.11769#bib.bib9)\), industry representatives repeatedly expressed doubts as to whether logistic regression models should actually be considered AI systems\(Association of Consumer Credit Information Suppliers,[2024](https://arxiv.org/html/2606.11769#bib.bib10)\)\. Although the European Commission published \(legally non\-binding\) guidelines at the beginning of 2025 that further clarify the definition of AI\(European Commission,[2025](https://arxiv.org/html/2606.11769#bib.bib17)\), there is still uncertainty among industry practitioners and regulators as to whether logistic regression is considered AI within the meaning of the AI Act\(Singhet al\.,[2025](https://arxiv.org/html/2606.11769#bib.bib28)\)\.

This ambiguity is also evident in the history of the AI Act: the first draft of the AI Act\(European Commission,[2021](https://arxiv.org/html/2606.11769#bib.bib16)\)added an Annex to the AI definition, listing specific methods that were considered AI\. In addition to machine learning, general statistical procedures and optimization methods were also mentioned\. This approach was primarily criticized because it would have also covered many conventional software programs\. In a compromise draft by the European Council\(Council of the European Union,[2024](https://arxiv.org/html/2606.11769#bib.bib20)\), this Annex was removed and replaced by two recitals referring to machine learning and logic\- and knowledge\-based approaches, listing logistic regression as a machine learning technique\.

The final version of the AI Act uses the technology\-neutral definition as described above\.

This work addresses this regulatory and conceptual ambiguity in two steps: First, we develop a framework for grading different levels of the capability to infer\. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed\. We illustrate the framework by creating two realistic credit scoring workflows and analyze where and to what extent inference occurs within the workflows\. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered and shows that the involvement of human experts during development can have significant influence on the capability to infer\.

## 2\.Related Work

#### AI definition of the AI Act:

The discussion of the definition has primarily been viewed from the perspective of its genesis and the problem of formally defining the subject matter of regulation in legal terms\.\(Schuett,[2023](https://arxiv.org/html/2606.11769#bib.bib30)\)argues that AI regulation in general should not rely on a definition of AI, as most existing AI definitions do not meet the requirements for legal definitions\.\(Eberset al\.,[2021](https://arxiv.org/html/2606.11769#bib.bib29)\)and\(Veale and Zuiderveen Borgesius,[2021](https://arxiv.org/html/2606.11769#bib.bib22)\)analyze the first proposal of the European Commission for the AI Act pointing out that the definition therein is too broad\.\(Finocchiaro,[2024](https://arxiv.org/html/2606.11769#bib.bib46)\)considers that having a list of AI\-techniques included in the AI Act may risk excluding future technological developments and\(Ellul,[2022](https://arxiv.org/html/2606.11769#bib.bib47)\)questions the feasibility of updating the list at a sufficient pace\.\(Castán,[2024](https://arxiv.org/html/2606.11769#bib.bib21)\),\(Presno Linera and Meuwese,[2025](https://arxiv.org/html/2606.11769#bib.bib40)\)and\(Fernández\-Llorcaet al\.,[2025](https://arxiv.org/html/2606.11769#bib.bib12)\)describe the process by which the definition was developed during the negotiations of the AI Act\. The latter one also analyzes terms like AI system, generative AI etc\. from an interdisciplinary perspective\.\(Floridi,[2023](https://arxiv.org/html/2606.11769#bib.bib3)\)traces the development of the AI Act’s definition of AI and examines the extent to which it is compatible with that of the American Executive Order\.\(Hacker,[2024](https://arxiv.org/html/2606.11769#bib.bib11)\)comments on the final trilogue version of the AI Act stating that inference is the only characteristic to distinguish AI systems from those built on classical software\.

#### Credit scoring under the AI Act:

The regulation of credit scoring through the AI Act has been investigated through different lenses:\(Spindler,[2023](https://arxiv.org/html/2606.11769#bib.bib32)\)and\(Montagnaniet al\.,[2024](https://arxiv.org/html/2606.11769#bib.bib33)\)analyze the impact of the \(first draft version of the\) AI Act regarding credit scoring, considering also existing banking regulation and its overlaps with the proposed AI Act\.\(Hacker and Eber,[2025](https://arxiv.org/html/2606.11769#bib.bib24)\)examine the regulation of credit scoring in the context of underwriting and the regulatory landscape for insurance\. Other work examines the legal requirements of individual trustworthiness dimensions for high\-risk AI systems:\(Pavlidis,[2024](https://arxiv.org/html/2606.11769#bib.bib34)\)investigates the AI Act’s requirements for explainability\.\(Buttaboni and Floridi,[2026](https://arxiv.org/html/2606.11769#bib.bib41)\)propose a regulatory taxonomy that distinguishes transparency, traceability, interpretability, and explainability as layered and interdependent dimensions of AI opacity and exemplify these for credit scoring\.

#### Different stages of inference:

In\(Breiman,[2001](https://arxiv.org/html/2606.11769#bib.bib27)\), Breiman highlights a qualitative shift from parametric estimation to data\-driven construction of decision logic\. Decision tree learning explicitly constructs input\-output mappings from data\(Quinlan,[1986](https://arxiv.org/html/2606.11769#bib.bib1); Breimanet al\.,[1984](https://arxiv.org/html/2606.11769#bib.bib39)\), while instance\-based and kernel methods realize inference through data\-induced similarity geometries rather than explicit rules\(Cover and Hart,[1967](https://arxiv.org/html/2606.11769#bib.bib35); Boseret al\.,[1992](https://arxiv.org/html/2606.11769#bib.bib36)\)\. Representation learning extends this dependence to the feature space itself, learning representations jointly with decision functions\(Bengioet al\.,[2013](https://arxiv.org/html/2606.11769#bib.bib38); Goodfellowet al\.,[2013](https://arxiv.org/html/2606.11769#bib.bib37)\)\.

While there is a lot of work on the AI Act, credit scoring, and different inference properties in the context of statistical learning, to the best of our knowledge, there is no work that analyzes the inference concept of the AI Act in greater technical detail\.

## 3\.Analysis of the term inference

In general, inference refers to ”the drawing of a conclusion from known or assumed facts or statements”\(Oxford English Dictionary,[2025](https://arxiv.org/html/2606.11769#bib.bib42)\)\. Statistical inference describes the process of drawing conclusions about a population based on data drawn from a sample of that population111The term population here refers to a set of similar items or events which is of interest for some question to be investigated\.\(Casella and Berger,[2002](https://arxiv.org/html/2606.11769#bib.bib43)\)\. In machine learning, inference refers to the application of a trained model to a new data point\. Since the AI Act is a legal text, the term inference must be interpreted strictly in its meaning\. At the same time, the legal interpretation of this term needs to be operationalized by computer science\.

Article 3 of the AI Act defines an AI system as “a machine\-based system that is designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments\.” As emphasized by Hacker\(Hacker,[2024](https://arxiv.org/html/2606.11769#bib.bib11)\), the capability to infer is the decisive distinguishing feature, since the remaining elements of the definition may also be satisfied by conventional software\.

Recital 12 clarifies that the ”capability to infer refers to the process of obtaining the outputs, such as predictions, content, recommendations, or decisions, which can influence physical and virtual environments, and to a capability of AI systems to derive models or algorithms, or both, from inputs or data”\. It emphasizes that the techniques that enable inference include machine learning approaches that learn from data how to achieve certain objectives, while systems that rely exclusively on human\-made rules or perform only simple data processing are excluded\. The Commission’s legally non\-binding Guidelines on the definition of an AI system further specify that this derivation primarily concerns the development phase of the system, without excluding the operational phase\(European Commission,[2025](https://arxiv.org/html/2606.11769#bib.bib17)\)\.

Taken together, Article 3 and Recital 12 establish inference as a structural criterion rather than a purely functional one\. A system exhibits the capability to infer when the form of the input\-output mapping that generates predictions, content, recommendations, or decisions is at least partially shaped by data, rather than being fully specified ex ante by human developers\. However, the AI Act does not specify how strong this data\-driven determination must be, leaving open where exactly the threshold for the capability to infer is to be drawn\.

To systematize this issue and relate it to the threshold implied by the AI Act, we develop a framework222Our framework applies to learning systems, i\.e\. it does not address the ”logic\- and knowledge\-based approaches” mentioned in Recital 12 of the AI Act\.that distinguishes different levels of data involvement in shaping input\-output mappings\. As a conceptual starting point, we draw on the formal abstraction of learning problems in statistical learning theory\(Mitchell,[1997](https://arxiv.org/html/2606.11769#bib.bib44)\)\.

In supervised learning, a learning problem is characterized by an input spaceX⊆ℝnX\\subseteq\\mathbb\{R\}^\{n\}, an output spaceYY, and a generally not directly observable mechanism that assigns outputs to inputs\. Inputsx∈Xx\\in Xare assumed to be drawn independently from an underlying probability distribution overXX, and each observed outputy∈Yy\\in Yreflects the response of this unobservable mechanism to the corresponding input\. The components of an input vectorx∈Xx\\in Xare individual measurable input variables called features\. A learning system observes a finite data setD=\{\(xi,yi\)\}i=0mD=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=0\}^\{m\}and aims to construct a functionh:X→Yh:X\\rightarrow Y, i\.e\., a mapping from inputs to outputs, that generalizes beyond the observed sampleDD\. The space of all such functions the learning system is allowed to choose from is called the hypothesis space333The so\-called VC dimension\(Vapnik and Chervonenkis,[1971](https://arxiv.org/html/2606.11769#bib.bib45)\)provides a measure of how ”large” this hypothesis space is\. The ”larger” the hypothesis space, the more learning options the associated system has; in other words, the greater its capability to infer\. However, since calculating the VC dimension can be quite complicated, we have decided not to include it in the framework developed below for reasons of practical applicability\.\. To assess the quality of such a functionhh, a loss functionl:Y×Y→ℝ\+l:Y\\times Y\\rightarrow\\mathbb\{R\}\_\{\+\}is specified, which measures the discrepancy between predicted outputs and observed responses\. Since the precise form of the data\-generating process is not observable, learning algorithms rely on empirical criteria computed from the observed data to guide the construction ofhh\. A specific and fully specified such function \(including its structure and parameter values\) is referred to as a model\. The process of obtaining such a model from a given data setDDis called training\.

This abstraction generalizes beyond supervised learning\. In unsupervised learning, only inputsxi∈Xx\_\{i\}\\in Xare observed, and learning aims at uncovering structural regularities in the distribution overXX, such as clusters or latent representations, i\.e\., relations of similarity between inputs\. In reinforcement learning, learning proceeds through sequential interaction with an environment, and decision functions are optimized with respect to cumulative reward signals rather than fixed target outputs\. Despite these differences, all three paradigms share a common structural perspective: learning involves identifying admissible input\-output mappings on the basis of data\.

Crucially, data can affect admissible input\-output mappings in qualitatively different ways\. They may merely fix numerical parameters within a fully specified functional form, restrict the set of admissible structures through selection or regularization, or actively construct a novel logic according to which output is generated\.

In many cases, this influence is mediated by transformations of the input space\. Feature selection\(Guyon and Elisseeff,[2003](https://arxiv.org/html/2606.11769#bib.bib31)\)restricts admissible inputs, feature generation applies predefined transformations, and feature learning constructs representations, i\.e\., encodings of the original features in another space\(Bengioet al\.,[2013](https://arxiv.org/html/2606.11769#bib.bib38); Goodfellowet al\.,[2013](https://arxiv.org/html/2606.11769#bib.bib37)\), that are themselves data\-dependent, thereby reshaping the space in which functions and output\-generating structures are formed\.

### 3\.1\.A Staged Framework of Inference

Building on this structural perspective, we propose a staged framework that distinguishes inference mechanisms according to how data influence the formation of input\-output mappings\. The framework is inspired by Breiman’s distinction\(Breiman,[2001](https://arxiv.org/html/2606.11769#bib.bib27)\)between the data modeling culture, in which the functional form of the input\-output mapping is specified ex ante and data serve primarily to estimate parameters, and the algorithmic modeling culture, in which the logic of the input\-output mapping itself is derived from data \.

We refine this distinction by defining a sequence of inference mechanisms, each strictly extending the previous one: each level introduces an additional way in which data can shape input\-output mappings\. The five levels are summarized in Table[1](https://arxiv.org/html/2606.11769#S3.T1)\.

#### 3\.1\.1\.Level 0: Fixed mapping

At the lowest level, systems implement a fully specified mapping from inputs to outputs, defined entirely ex ante by human designers\. There are no alternative input\-output mappings available to the system\. Data may be processed or filtered, but they do not influence parameters or structure\. Such systems correspond to rule\-based execution or basic data processing as described in Recital 12 of the AI Act\.

#### 3\.1\.2\.Level 1: Parametric adaptation within a fixed structure

At the first non\-trivial level, the functional form of the input\-output mapping is fixed in advance, but contains free numerical parameters\. Data are used to determine these parameter values, while the structure of the input\-output mapping remains unchanged\. Classical linear and logistic regression exemplify this setting\. For instance, input\-output mappings may be restricted to quadratic polynomials of the formh\(x\)=ax2\+bx\+c;a,b,c∈ℝh\(x\)=ax^\{2\}\+bx\+c;\\,\\,a,b,c\\in\\mathbb\{R\}, where learning consists solely in estimating the coefficients\. Data thus determine a point within a predefined family of input\-output mappings, but do not alter the functional form of the mapping itself\.

#### 3\.1\.3\.Level 2: Structural selection

At Level 2, data are used to select among a finite or discretely defined set of admissible input\-output mappings\. The available alternatives are specified ex ante, but data determine which one is employed\.

A canonical example is stepwise feature selection\. In its forward variant, the learning procedure starts from a minimal model containing no explanatory variables and successively adds features from a predefined pool\.

L1L\_\{1\}\-regularization also falls into this category\. While implemented via continuous optimization, the constraint induces sharp exclusions of entire parameter combinations, which effectively removes structural alternatives from consideration\.

#### 3\.1\.4\.Level 3: Data\-driven structural construction

Beyond selection lies structural construction: input\-output mappings are not chosen from a predefined catalog but are functionally constructed from data\. The concrete form of the input\-output mapping is not specified ex ante\. This can happen in an explicit \(symbolic input\-output mapping is learned\) or an implicit \(input\-output mapping cannot be made explicit\) manner\.

#### Level 3a: Explicit structural construction

Decision trees and rule learning are canonical examples\. While learning algorithms impose general constraints \(e\.g\. admissible split operations or maximum depth\), the concrete structure \- splits, rules, and their arrangement \- is built in response to the observed data\. Data thus determine not only which structure is used, but how it is formed\. This represents a stronger form of inference, as data induce novel structural configurations rather than merely activating predefined ones\.

#### Level 3b: Implicit structural construction

Structural construction need not yield an explicit symbolic model\. Instance\-based methods such as k\-nearest neighbors or kernel methods rely on data\-induced relational structures, such as neighborhood relations or similarity geometries\. Here, the input\-output mapping is realized through the geometry of the data rather than through explicitly represented rules\.

#### 3\.1\.5\.Level 4: Representational construction

The strongest form of inference arises when data determine not only the input\-output mapping but also the representational space in which the outputs are formed\. In representation learning, features are no longer fixed inputs but learned objects\. Deep neural networks and large language models exemplify this regime\. Here, admissible input\-output mappings are themselves reshaped by data through the learning of latent representations\.

This framework does not by itself determine at which level the capability to infer is reached within the meaning of the AI Act\. Rather, it provides a structured basis for interpreting this threshold and to make sure that similar borderline cases are treated in a similar manner\. Indications for where the line may be drawn can be derived from the \(legally non\-binding\) Commission Guidelines on the definition of an artificial intelligence system\(European Commission,[2025](https://arxiv.org/html/2606.11769#bib.bib17)\)\. However, as their explanations are a bit confusing, their interpretation is not straightforward\. Article 42 of these guidelines states that ”systems used to improve mathematical optimization or to accelerate and approximate traditional, well established optimization methods, such as linear or logistic regression methods, fall outside the scope of the AI system definition\. This is because, while those models have the capacity to infer, they do not transcend ‘basic data processing’\.” The confusion arises for two reasons: First, Recital 12 of the AI Act states that ”the capacity of an AI system to infertranscendsbasic data processing”, i\.e\., the capacity to inferis morethan basic data processing\. Second, one can view the construction of regression models \- just like any other machine learning model as well \- as the application of a mathematical optimization procedure, namely the minimization of the loss function\. However, there is generally no connection where the improvement, acceleration or approximation of optimization methods plays a role in regression models \(nor in machine learning models\), which causes additional confusion\.

If Article 42 is interpreted to mean that systems based on regression models do not constitute AI in the sense of AI Act, the threshold for the capability to infer must be set as follows: systems at Level 1 may fall outside the scope of the AI system definition, as their structure is fully specified ex ante and data merely determine parameters\. By contrast, systems at Level 3 and above closely align with the notion of “deriving models or algorithms from data” in Recital 12, as the input\-output mapping is functionally constructed from data\. Systems at Level 2 occupy an intermediate position, since they involve data\-driven selection among predefined structures without fully constructing them\.

At the same time, the ambiguity of the Guidelines leaves room for an alternative interpretation: one could argue that even parameter estimation within a fixed functional form already constitutes a form of inference, as the model is fitted to data and thereby “derived” from it\. On this reading, the capability to infer could already be present at Level 1, shifting the threshold of the AI system definition accordingly\.

As the Commission Guidelines\(European Commission,[2025](https://arxiv.org/html/2606.11769#bib.bib17)\)emphasize that “the notions of autonomy and inference go hand in hand”, we finally analyze the relationship between autonomy and the capability to infer\. Our discussion shows that the key distinguishing feature of AI systems is the capability to infer and that autonomy plays only a secondary role: In classical AI literature, autonomy is often understood in epistemic terms\. For instance,\(Russell and Norvig,[2021](https://arxiv.org/html/2606.11769#bib.bib51)\)define autonomy as the extent to which an agent444Note that this term refers to the notion of ”classical” AI agents in the computer science literature, however it also holds true for ”modern”, LLM\-based AI\-agents\.relies on its own percepts and learning processes rather than on prior knowledge provided by its designer\. On this understanding, autonomy is closely linked to the system’s ability to learn from and adapt to data, and thus conceptually closely aligned with what the AI Act describes as the capability to infer\. However, the concept of autonomy introduced by the AI Act is more reminiscent of automation: Article 3 of the AI Act requires that a system is “designed to operate with varying levels of autonomy”, which Recital 12 defines as some degree of independence from human involvement and the capabilities to operate without human intervention\. This particular requirement would also be fulfilled by systems that are based on ”classical” software \(i\.e\. systems that are not considered to be AI according to Recital 12\)\. The Guidelines refine the concept somewhat: They further state that a ”system that requires manually provided inputs to generate an output by itself is a system with ‘some degree of independence of action’, because the system is designed with the capability to generate an output without this output being manually controlled, or explicitly and exactly specified by a human\.” If one interprets the last half\-sentence to mean that systems based on the rules defined solely by natural persons do not have autonomy, one is again close to the computer science concept of autonomy, which in turn is closely linked to the concept of the capability to infer in the AI Act\. Otherwise, if the concept of autonomy is interpreted in the meaning of automation, it cannot serve as a distinguishing feature of AI systems\.

LevelInference mechanismRole of data in shaping the input\-output mappingExamples0Fixed mappingNo restriction or adaptation of structure by dataHard\-coded rules, deterministic decision logic, fixed descriptive statistics \(means, counts\)1Parametric adaptationData determine parameter values within a fixed structureLinear regression, logistic regression,kk\-means with fixedkk, principal component analysis with fixed dimensionality2Structural selectionData select among predefined alternatives with fixed structureStepwise feature selection, L1\-regularized regression, data\-driven selection of number of clusterskkinkk\-means3aStructural construction \(explicit\)Data construct symbolic or combinatorial structureDecision trees, rule learning, hierarchical clustering, tree\-based discretization3bStructural construction \(implicit\)Data induce relational or geometric structurekk\-nearest neighbors, kernel support vector machines, kernel principal component analysis4Representational constructionData induce the representational space itselfDeep neural networks, transformers, autoencoders, self\-supervised representation learningTable 1\.Staged framework of inference mechanisms that derive input\-output mappings from data

### 3\.2\.Categorization of typical data driven models used for credit scoring

In order to prepare our analysis of the credit scoring use case, we investigate binning and logistic regression with regard to our framework\. Binning is commonly used as a data preprocessing technique rather than considered as a machine learning model on its own\. Within the meaning of the AI Act, binning is not considered AI as it does not produce predictions, decisions, recommendations, or outputs\. However, it can contribute to the overall capability to infer of an AI system\.

#### 3\.2\.1\.Binning

Binning transforms a numerical input variable into a categorical representation by partitioning its domain into intervals\. In manual binning, interval boundaries are defined ex ante based on expert judgment or regulatory conventions\. Similarly, quantile\-based binning fixes the number of binsKKand defines intervals via empirical quantiles:Bk=\[Q\(k−1\),Qk\),B\_\{k\}=\[Q\_\{\(k\-1\)\},Q\_\{k\}\),whereQkQ\_\{k\}denotes splitting points chosen so that each binBkB\_\{k\}comprises the same number of observations\. Although quantile binning uses data to determine numerical cut points, the structural form of the partition—equal\-frequency bins with fixed cardinality—is fully specified in advance and independent of the target variable\. Thus it remains below structural construction in the framework\.

By contrast, data\-driven binning methods commonly used in credit scoring rely on recursive binary splits that optimize a target\-dependent impurity criterion, most prominently the Shannon entropy\. Let𝒳\\mathcal\{X\}be a set of instances with target variableyytaking values inY\{Y\}\. Then the Shannon entropy of the target variable in𝒳\\mathcal\{X\}is defined as

\(1\)H\(𝒳\)=−∑y∈Yp\(y\)log⁡p\(y\),H\(\\mathcal\{X\}\)=\-\\sum\_\{y\\in\{Y\}\}p\(y\)\\log p\(y\),wherep\(y\)p\(y\)is the proportion of classy∈Yy\\in\{Y\}in𝒳\\mathcal\{X\}\(Hastieet al\.,[2009](https://arxiv.org/html/2606.11769#bib.bib7)\)\. For a candidate splitssthat partitions the set𝒳\\mathcal\{X\}into regionsR1R\_\{1\}andR2R\_\{2\}, the information gainIG\(s\)IG\(s\)is defined as

\(2\)IG\(s\)=H\(𝒳\)−∑k=12\|Rk\|\|𝒳\|H\(Rk\)\.IG\(s\)=H\(\\mathcal\{X\}\)\-\\sum\_\{k=1\}^\{2\}\\frac\{\|R\_\{k\}\|\}\{\|\\mathcal\{X\}\|\}H\(R\_\{k\}\)\.The algorithm recursively selects the split that maximizesIG\(s\)IG\(s\)subject to stopping criteria such as a maximum number of bins or a minimum relative reduction in entropy\.

Although practical implementations impose explicit constraints, such as these stopping criteria, the resulting partition is not selected from a predefined catalog of admissible structures\.555If one can enumerate the possible splitting points in advance and the number of splitting points is fixed then capability to infer might be reduced to Level 2\. Consider e\.g\., the interval\[0,100\]\[0,100\]\. In case splits are only allowed at natural numbers and the number of total splits iskk, then there are\(99k\)\\bigg\(\\begin\{array\}\[\]\{c\}99\\\\ k\\end\{array\}\\bigg\)potential splits\. Due to the fast growth of this number fork→50k\\to 50, the bins will effectively be determined by the data as, for example, the CART algorithm\(Breimanet al\.,[1984](https://arxiv.org/html/2606.11769#bib.bib39)\)does not consider all splits but is searching for a local minimum\. However, for smallk∼1,2k\\sim 1,2, the system does not fall into Level 3\.Instead, the input\-output mapping—i\.e\., the exact number, location, and hierarchy of splits—is functionally constructed from the data via an inductive procedure\. It is thus considered Level 3a\.

#### 3\.2\.2\.Logistic regression

Logistic regression assumes a fixed parametric form for the conditional probability of the target variable given an input vectorx∈ℝdx\\in\\mathbb\{R\}^\{d\}:

\(3\)ℙ\(y=1∣x\)=σ\(w⊤x\+b\),σ\(z\)=11\+e−z\.\\mathbb\{P\}\(y=1\\mid x\)=\\sigma\(w^\{\\top\}x\+b\),\\qquad\\sigma\(z\)=\\frac\{1\}\{1\+e^\{\-z\}\}\.Learning consists of estimating the parameter vectorw∈ℝdw\\in\\mathbb\{R\}^\{d\}and bias termb∈ℝb\\in\\mathbb\{R\}, typically by minimizing a convex loss function such as the negative log\-likelihood\. The hypothesis space is fixed and finite\-dimensional, and data influence only the numerical values of the parameters\. In this configuration, logistic regression exemplifies Level 1 \(parametric adaptation\), as no structural aspects of the decision logic are derived from data\.

In practical credit\-scoring applications, logistic regression models are often trained using stepwise feature selection\(Bursacet al\.,[2008](https://arxiv.org/html/2606.11769#bib.bib49); Hosmer and Lemeshow,[2000](https://arxiv.org/html/2606.11769#bib.bib50)\)\. Here, in the forward variant, one starts with a model with the ”best” variable where ”best” is determined using pre\-determined criteria, such as the Wald test for statistical significance of that variable\. Successively, further variables are added until a stopping condition is reached\.

While the functional form of the model remains unchanged, data determine which structural alternative—namely, which feature subset—is selected\. Such configurations therefore correspond to Level 2 \(structural selection\), as inference operates through data\-driven choice among predefined structural options rather than through structural construction\.

Taken together, these examples illustrate that credit\-scoring systems frequently combine components operating at different levels of inferential capability\. Analyzing such systems through the proposed framework allows their constituent techniques to be disentangled and supports a more granular assessment of which elements plausibly instantiate the capability to infer as described in Article 3 and Recital 12 of the AI Act\.

## 4\.Capability to infer of creditworthiness models

In order to analyze whether statistical models used for credit scoring have the capability to infer and thus fall under the AI definition of the AI Act, we apply the framework developed in the last section to two examples of such credit scoring models\. We start by recalling how such credit scoring models are developed for typical industrial applications\.

### 4\.1\.Classical scorecard development workflow

Classical credit scorecards remain a widely deployed and regulatorily accepted approach for assessing creditworthiness, particularly in retail lending and consumer finance\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Bückeret al\.,[2022](https://arxiv.org/html/2606.11769#bib.bib2)\)\. Despite their apparent simplicity, scorecards are typically developed through a structured, multi\-stage process that combines statistical modeling with domain expertise and governance constraints\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Bückeret al\.,[2022](https://arxiv.org/html/2606.11769#bib.bib2)\)\.

The process begins with the definition of the target variable and associated observation and performance windows\. In credit risk applications, this usually entails a binary default indicator, such as a 90\-days\-past\-due event within a fixed horizon\(Hand and Henley,[1997](https://arxiv.org/html/2606.11769#bib.bib13); Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Anderson,[2007](https://arxiv.org/html/2606.11769#bib.bib6)\)\. Careful specification of these elements is critical, as it determines both the semantic meaning of the model output and its regulatory validity\.

Following target definition, the raw dataset, often comprising hundreds of candidate variables, is subjected to extensive data cleaning and preprocessing\. This includes consistency checks, treatment of missing values, and outlier handling, with the objective of ensuring robustness and stability rather than maximizing predictive performance\(Finlay,[2012](https://arxiv.org/html/2606.11769#bib.bib5)\)\. At this stage, variables may be removed for purely technical or regulatory reasons, without reference to predictive power\.

A subsequent variable shortlisting phase reduces the dimensionality of the dataset\. This step typically combines domain knowledge, regulatory constraints, and simple univariate statistics such as missing value rates or stability measures\. While quantitative indicators may be used, the shortlisting process in practice often remains partially judgment\-driven\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\.

Continuous and ordinal variables are then discretized through binning\. Binning serves multiple purposes: it stabilizes variable behavior over time, facilitates interpretability, and prepares the data for transformation into Weight\-of\-Evidence \(WoE\) values \(as defined in the next paragraph\)\. Common approaches include manual binning as well as tree\-based binning methods, with missing values typically treated as separate bins\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\. Typical loss functions for tree\-based methods are the Gini Impurity or the Shannon entropy and Information Gain, as defined in Section[3\.2\.1](https://arxiv.org/html/2606.11769#S3.SS2.SSS1)\.

A common constraint considered in the binning process is that every feature should have a clear direction of effect, i\.e\., the WoE values for non\-missing values should monotonically increase or monotonically decrease across the bins, without any reversals\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Finlay,[2012](https://arxiv.org/html/2606.11769#bib.bib5)\)\. This ensures that only features with comprehensible effects on the default predictions are used in scoring, in the interest of transparency\. Further common constraints are that each bin shall contain a minimum of 5% of instances, and that there shall not be any bins with 0 counts of ’bad’ or ’good’ cases\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\. Note that while these constraints are not regulatory requirements, they are well\-established best practices\. After binning, further variables may be excluded due to lack of adherence to these constraints or any others previously set for the process\.

The binned variables are transformed using the Weight of Evidence \(WoE\) framework\(Hand and Henley,[1997](https://arxiv.org/html/2606.11769#bib.bib13)\), which encodes each bin in terms of its ability to distinguish between non\-defaults and defaults\. This transformation linearizes the relationship between predictors and the log\-odds of default, enabling the effective use of logistic regression while preserving interpretability\.

Univariate measures such as the Information Value \(IV\) are then employed to assess the predictive contribution of individual variables\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\. Variables with low discriminatory power or poor stability are removed prior to multivariate modeling\.

A typical approach to estimating the core model is using logistic regression on the WoE\-transformed variables\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Anderson,[2007](https://arxiv.org/html/2606.11769#bib.bib6); Finlay,[2012](https://arxiv.org/html/2606.11769#bib.bib5); Lessmannet al\.,[2015](https://arxiv.org/html/2606.11769#bib.bib14); Bückeret al\.,[2022](https://arxiv.org/html/2606.11769#bib.bib2)\)\. Variable selection may be performed using stepwise procedures, subject to constraints on multicollinearity and model complexity\. Common metrics for model performance used in variable selection include, among others, the Area under the Receiver Operating Curve \(AUROC\), the Gini coefficient, and the Average Precision score \(AP\)\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Davis and Goadrich,[2006](https://arxiv.org/html/2606.11769#bib.bib15)\)\. Formal definitions of WoE, IV, AUROC, Gini, and AP are provided in Appendix[A](https://arxiv.org/html/2606.11769#A1)\.

While AUROC and Gini are sensitive to skewed distributions of the target variable, and can produce high scores in spite of poor performance on the minority class, AP is more robust to such imbalances\(Davis and Goadrich,[2006](https://arxiv.org/html/2606.11769#bib.bib15)\)\. In credit scoring, distributions of non\-default and default tend to be highly skewed towards non\-default\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Finlay,[2012](https://arxiv.org/html/2606.11769#bib.bib5)\)and thus, indicators that are robust to class imbalance should be considered in both the construction and evaluation of scoring models\. Multicollinearity can be controlled using, for example, the Variance Inflation Factor \(VIF\) to identify highly correlated variables\(Finlay,[2012](https://arxiv.org/html/2606.11769#bib.bib5)\)\.

The resulting final model trained on the selected variables is then validated using out\-of\-sample testing and stability metrics to ensure robustness over time\.

Finally, the model is converted into an operational scorecard by defining a score offset and performing points\-to\-double\-the\-odds \(PDO\) score scaling, in which the points associated with the feature bins are scaled such that an increase of a certain number of points in the score corresponds to a doubling of the odds of non\-default\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\. The completed scorecard is embedded within a governance framework that includes documentation, approval processes, monitoring, and periodic review, in line with regulatory expectations\.

### 4\.2\.Illustration using two credit scoring examples

We now illustrate our framework using two example credit scoring approaches that exhibit different degrees of inference capability\. Both examples are built using the Give Me Some Credit dataset from Kaggle\(Fusion and Cukierski,[2011](https://arxiv.org/html/2606.11769#bib.bib52)\), which contains 150\.000 instances with 10 features and the binary target variable of a default to be predicted\. A default here means a delinquency of over 90 days past due\. Definitions of all features as well as the full resulting scorecards are presented in Appendix[B](https://arxiv.org/html/2606.11769#A2)\.

Before constructing the two models, we perform preprocessing of the data as follows\. Missing values forMoIncare marked for treatment as a separate bin later on\. Neither the target variable nor any of the other features have missing values for any of the instances in the dataset\. Furthermore, we want to exclude the featuresAgeandNumberOfDependentsfrom use for scoring for ethical reasons and thus discard them from the data\. Lastly, the data is split into 70% training and 30% test data\.

The binning procedures differ between the models, but both consider the best practice constraints as stated in Section[4\.1](https://arxiv.org/html/2606.11769#S4.SS1): Each bin must contain a minimum of 5% of instances, there must not be any bins with 0 counts of default or non\-default cases respectively, and we enforce monotonicity of the WoE values across bins for every feature, with exception of the missing value bin ofMoIncas well as for the 0 values of selected features, based on domain\-knowledge\. In preparation for binning of the features, we thus consider the direction of that monotonicity for every feature\. For example, a higherMoIncpoints to a higher probability of on\-time payments and thus a lower risk of default, whereas for the featureN30\-59Late, which counts the number of times that a person has already had payments be 30 to 59 days past due previously in their credit history, the higher that number, the higher the risk of future default may be\.666Definitions of all features and their respective expected directions of monotonicity, including the rationale behind them, are presented in Appendix[B](https://arxiv.org/html/2606.11769#A2)\.

Feature selection processes differ strongly between approaches and are discussed individually in the respective following sections\.

The models themselves are both constructed with logistic regression, using the scikit\-learn implementationLogisticRegression\(with the ’liblinear’ solver\)\. Lastly, scorecards are constructed using scaling with PDO, setting2020as the number of points to double the odds of a non\-default outcome and a score\-offset of600600points\. The final scorecards for both models are given in Appendix[B](https://arxiv.org/html/2606.11769#A2)\.

#### 4\.2\.1\.Example I: Semi\-automated credit scorecard construction

After data preprocessing as outlined above, binning of the feature variables is performed using theDecisionTreeClassifierfrom scikit\-learn, which, node by node, determines the split point that best separates default and non\-default cases, optimizing for minimal entropy in each bin\. The monotonic constraints are passed to theDecisionTreeClassifierencoded as values of 1 and \-1 for monotonic increase or decrease, respectively, for each feature\. Based on these constraints, the algorithm only allows splits where the average probability of default either monotonically increases or monotonically decreases across bins, automatically rejecting splits that would violate monotonicity\. Similarly, the 5% rule is passed to the function by setting themin\_samples\_leafparameter to0\.050\.05, and the condition of no 0 counts of either default or non\-default instances per bin is enforced via an if\-condition in the surrounding algorithm\. An example of a resulting decision tree is given in Figure[1](https://arxiv.org/html/2606.11769#S4.F1)\. This method of binning is data\-driven while simultaneously being consistent with domain expectations and the given constraints\.

![Refer to caption](https://arxiv.org/html/2606.11769v1/x1.png)Figure 1\.Decision tree binning for the featureN30\-59Late\. Node labels show split criteria, and \(rounded\) sample proportions, class distributions in the formatvalue = \[good, bad\], and entropy values for each bin\. Note that this is a simple tree with only three bins for illustration, and more complex trees are constructed in Example I\.Next, we train a logistic regression model to statistically infer the relationship between the WoE\-transformed features and the target variable\. The selection of features for use in the final scorecard is performed stepwise, iteratively choosing the most predictive features in an automated process\. At each step, the algorithm evaluates all remaining features and selects the one that warrants the largest improvement of model performance, as measured by the average precision score\. If the selected feature improves the score by at least0\.0020\.002, it is added to the model, otherwise, it is discarded and the process stops\. At the end of every forward step, a backward check of multicollinearity is performed and any feature with a VIF greater than1010is removed from the model777Following a rule of thumb as stated in\(Finlay,[2012](https://arxiv.org/html/2606.11769#bib.bib5)\)\.\.

The final logistic regression model, trained on the WoE values and ground truths of the selected features, achieves an AP of 0\.3737 and an AUROC of 0\.8517\.

#### 4\.2\.2\.Example II: Manual credit scorecard construction

After data preprocessing, we perform manual binning and feature selection largely following the process described in\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\. For each feature in the dataset, we first consider equal\-frequency binnings and calculate Bad Rates \(i\.e\., percentage of default outcomes\) and WoE values per bin, as well as IV per binning\. For further manual design of bin split points, these statistics are sorted into easy\-to\-comprehend tables, an example of one is given in Table[2](https://arxiv.org/html/2606.11769#S4.T2)\. Based on this information, we then manually and iteratively define split points to create improved binnings, with the aim of maximizing IV, under consideration of the aforestated best practice conditions\. Features that cannot be split into bins with WoE values that adhere to all constraints while simultaneously being intuitively explainable are excluded from further consideration for the score model\.

MoInc\(Monthly Income\)IV = 0\.0721BinCountBadsBad RateWoEmissing value2072311365\.48%0\.21\[0\.0,2692\.0\]\[0\.0,\\,2692\.0\]1404712368\.80%\-0\.30\(2692\.0,4000\.0\]\(2692\.0,\\,4000\.0\]1431613069\.12%\-0\.34\(4000\.0,5382\.0\]\(4000\.0,\\,5382\.0\]1377810717\.77%\-0\.16\(5382\.0,7059\.0\]\(5382\.0,\\,7059\.0\]140458986\.39%0\.05\(7059\.0,9900\.0\]\(7059\.0,\\,9900\.0\]140927585\.38%0\.23≥\\geq9900\.0139996134\.38%0\.45Table 2\.Equal\-frequency binning overview for featureMoInc\.Of the remaining features, all features with an IV of0\.020\.02or greater are selected for use in the model\. For the final set of selected features, WoE tables are calculated, and the final regression model is trained on the WoE values and the target variable ground truths\. This process results in a model with an AP of 0\.3602 and AUROC of 0\.8502 on the test data\.

## 5\.Inference levels of the semi\-automated and manual credit scoring approaches

Both scorecard construction approaches share the constraints of the binning, the use of logistic regression to train the final model, and the employed indicators of IV, WoE values, and average precision score\. Yet, they differ fundamentally in the degree to which the construction of the model’s input\-output mapping is driven by the data\.

Data preprocessing is performed identically for both approaches and without any data\-driven construction of model structure, as the treatment of missing values and the exclusion of features for ethical reasons occur manually\.

The first significant conceptual difference lies in the binning process\. In Example I, split points are determined by a decision tree optimizing entropy, subject to the human\-defined constraints of monotonicity and minimum numbers of instances per bin\. Here, the data actively construct the input\-output mapping, corresponding to Level 3a in our framework\. In contrast, Example II employs an iterative manual binning based on domain knowledge and quantitative indicators\. The structure is determined entirely by human reasoning, corresponding to Level 0\.

For Example II, the feature selection is performed manually based on the Information Value of the binned features\. A logistic regression model is trained on these preselected variables, which represents a Level 1 case\. Example I employs an automated stepwise data\-driven process to train a logistic model which, in every step, performs model training and selects the AP maximizing feature without active human intervention\. This constitutes a Level 2 case\.

In summary, the first approach, due to its data\-driven tree\-based binning and the resulting data\-driven construction of the model’s input\-output mapping, is overall classified as Level 3a \(explicit structural construction\) as defined in Table[1](https://arxiv.org/html/2606.11769#S3.T1)\. The second approach is based on manual design decisions throughout, but employs logistic regression for parameter fitting\. Thus, it is classified as Level 1 \(parametric adaptation\)\. Here, we have tacitly introduced the rule that the data processing component with the highest inference level determines the inference level of the overall system\.

### 5\.1\.Resulting scorecards and the impact of human intervention on inference levels

The complete scorecards are found in Appendix[B](https://arxiv.org/html/2606.11769#A2), an excerpt for purposes of comparison and discussion is presented in Table[3](https://arxiv.org/html/2606.11769#S5.T3)\. For some features, the two approaches produce extremely similar results\. For example, forN30\-59Late\(number of minor prior delinquencies between 30 and 59 days\), both approaches produce the same effective binnings and corresponding WoEs, with only minor deviations in points as a result of differences in the scorings of other attributes \(see Table[3](https://arxiv.org/html/2606.11769#S5.T3)\)\. Results are especially similar for features that only take integer values and thus offer significantly fewer candidate split points that still fulfill all constraints than real\-valued features\.

Example I \(semi\-automated\)Example II \(manual\)BinWoEPtsBinWoEPtsN30\-59Late00\.532800\.53281\-0\.882\-131\-0\.882\-14≥\\geq2\-1\.895\-28≥\\geq2\-1\.895\-29MoIncmissing value0\.2113missing value0\.2112\[0\.0, 3332\.5\]\-0\.355\-5\[0\.0, 3000\.0\]\-0\.334\-3\(3332\.5, 3920\.5\]\-0\.246\-3\(3000\.0, 5000\.0\]\-0\.227\-2\(3920\.5, 4621\.5\]\-0\.220\-3\(5000\.0, 7000\.0\]0\.0090\(4621\.5, 5563\.5\]\-0\.106\-1\(7000\.0, 10000\.0\]0\.2312\(5563\.5, 6550\.5\]0\.0020≥\\geq10000\.00\.4645\(6550\.5, 7656\.5\]0\.1902\(7656\.5, 10284\.0\]0\.2864≥\\geq10284\.00\.4746Table 3\.Comparison of scorecards for attributesN30\-59LateandMoInc, with bin split points, WoE values and points rounded\.Comparing the scorecards for the monthly income featureMoInc\(see Table[3](https://arxiv.org/html/2606.11769#S5.T3)\), theDecisionTreeClassifierbinning created more splits\. The second and third bin map to very similar WoE values and the exact same number of points \(due to rounding effects\)\. In practice, a human expert would likely decide to combine these two bins from the Example I scorecard into one\. While this constitutes a human intervention, the basic structure of the binning as performed by the decision tree would still persist, and the binning process overall would remain classified as Level 3a in our framework\.

However, if a human expert completely overrules the data\-derived split points and continues the scorecard design workflow with entirely human\-defined binnings, such as adopting theMoIncstructure from Example II instead of Example I, the significance of the tree\-based binning for the overall workflow is lost\.

In summary, the degree and type of human intervention determine whether the binning retains its Level 3a classification\. Minor adjustments, such as combining similar bins, merely constitute further processing of the data\-derived structure and preserve the data\-driven nature of the process\. In contrast, complete human overruling of the data\-derived binning, such as setting new split points, destroys the capability to infer\.

### 5\.2\.The need to consider complete workflows

The assessment of the level of inferential capability that a system possesses requires examination of its entire workflow rather than isolated components\. Consider the binning process from Example I in isolation: while it employs automated decision tree binning, this does not constitute an AI system on its own, as it does not produce predictions, decisions, recommendations, or outputs\. Instead, this process is of structure\-defining significance only when integrated into the credit scorecard modeling workflow, where it contributes to determining the ultimate overall inferential capability of the system in the sense of the AI Act\. Vice versa, if one were to assess the capability to infer based solely on the final model — logistic regression — one would arrive at an incorrect result\.

## 6\.Conclusion

This work addresses the classification of the inferential capability of data\-driven models within the meaning of the European AI Act\. The AI Act formulates the capability to infer as a key distinguishing feature for the definition of artificial intelligence\. At the same time, it provides only vague indications of the conditions under which inferential capability exists, making it unclear for certain borderline cases of data\-driven models whether they have sufficient inferential capability to be considered AI and thus regulated by the AI Act\. The framework we have developed builds on the fundamentals of statistical learning theory and defines five levels of inference: fixed mappings without any inference capability \(Level 0\), parametric models \(Level 1\), selection from predefined structural alternatives \(Level 2\), data\-driven structural construction \(Level 3\), and models that learn the relevant features themselves \(Level 4\)\. Our framework cannot determine exactly where the threshold for the capability to infer lies, but it does help to systematize this discussion\. In addition, it ensures comparability when assessing similar borderline cases\. While Level 0 is certainly not AI within the meaning of the AI Act and Levels 3 and 4 have the capability to infer, the classification of Levels 1 and 2 is unclear\. The legally non\-binding Commission Guidelines on the definition of an AI system might suggest that Level 1 should be classified as insufficient capability to infer\.

By applying the framework to a specific use case—credit scoring—we illustrate that the entire development data processing workflow must be considered when determining an AI system’s capability to infer\. Credit scoring represents an interesting use case because, on the one hand, it is listed by Annex III of the AI Act and is thus a high\-risk system, provided the criteria as outlined by Article 6 are met\. On the other hand, it typically uses techniques whose inference capabilities are unclear a priori\. For our analysis, we constructed two realistic credit score cards based on the Give Me Some Data dataset, both of which are based on logistic regression\. The first uses data\-driven methods for binning and feature selection for logistic regression and is classified as Level 3 in our framework\. The second is based on manual binning and feature selection and is classified as a parametric model in Level 1\. Furthermore, we show that the capability to infer can be lost when human interventions destroy the data\-inferred input\-output mapping\.

## 7\.Limitations

The dataset we use has fewer instances and features than a typical one used in the industry\. However, this does not limit our analysis, as it aims to systematically evaluate the inference capability of an end\-to\-end credit scoring workflow in[4\.2](https://arxiv.org/html/2606.11769#S4.SS2)\.

The framework primarily addresses AI systems that are developed in a data\-driven manner\. An extension to symbolic AI systems would be interesting\.

## 8\.Statement of generative AI use

The authors used generative AI tools \(GPT 5, GPT 4, DeepL, Qwen3 Coder\) to set LaTeX tables and as feedback tools for code and text improvement\. Most of the text and code are human\-written, but individual sentences and code cells were generated using these tools\. The authors reviewed and are responsible for all content\.

## 9\.Acknowledgments

This work was supported by the Ministry of Economic Affairs, Industry, Climate Action and Energy of the State of North Rhine\-Westphalia as part of the flagship project ZERTIFIZIERTE KI\.

## References

- R\. Anderson \(2007\)The credit scoring toolkit: theory and practice for retail credit risk management and decision automation\.Oxford University Press,Oxford, UK\.External Links:ISBN 978\-0\-19\-922640\-5,[Link](https://doi.org/10.1093/oso/9780199226405.001.0001),[Document](https://dx.doi.org/10.1093/oso/9780199226405.001.0001)Cited by:[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p9.1)\.
- Association of Consumer Credit Information Suppliers \(2024\)External Links:[Link](https://accis.eu/accis-response-to-consultation-on-european-commissions-guidelines-for-an-ai-system-definition)Cited by:[§1](https://arxiv.org/html/2606.11769#S1.p3.1)\.
- Y\. Bengio, A\. Courville, and P\. Vincent \(2013\)Representation learning: a review and new perspectives\.IEEE Transactions on Pattern Analysis and Machine Intelligence35\(8\),pp\. 1798–1828\.External Links:ISSN 0162\-8828,[Link](https://doi.org/10.1109/TPAMI.2013.50),[Document](https://dx.doi.org/10.1109/TPAMI.2013.50)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx3.p1.1),[§3](https://arxiv.org/html/2606.11769#S3.p9.1)\.
- B\. E\. Boser, I\. M\. Guyon, and V\. N\. Vapnik \(1992\)A training algorithm for optimal margin classifiers\.InProceedings of the Fifth Annual Workshop on Computational Learning Theory,COLT ’92,New York, NY, USA,pp\. 144–152\.External Links:ISBN 0\-89791\-497\-X,[Link](https://doi.org/10.1145/130385.130401),[Document](https://dx.doi.org/10.1145/130385.130401)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx3.p1.1)\.
- L\. Breiman, J\. H\. Friedman, R\. A\. Olshen, and C\. J\. Stone \(1984\)Classification and regression trees\.Wadsworth,Belmont, CA, USA\.External Links:ISBN 0\-534\-98053\-8Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx3.p1.1),[footnote 5](https://arxiv.org/html/2606.11769#footnote5)\.
- L\. Breiman \(2001\)Statistical modeling: the two cultures \(with comments and a rejoinder by the author\)\.Statistical Science16\(3\),pp\. 199–231\.External Links:[Link](https://doi.org/10.1214/ss/1009213726),[Document](https://dx.doi.org/10.1214/ss/1009213726)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx3.p1.1),[§3\.1](https://arxiv.org/html/2606.11769#S3.SS1.p1.1)\.
- M\. Bücker, G\. Szepannek, A\. Gosiewska, and P\. Biecek \(2022\)Transparency, auditability, and explainability of machine learning models in credit scoring\.Journal of the Operational Research Society73\(1\),pp\. 70–90\.External Links:[Link](https://doi.org/10.1080/01605682.2021.1922098),[Document](https://dx.doi.org/10.1080/01605682.2021.1922098)Cited by:[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p9.1)\.
- Z\. Bursac, C\. H\. Gauss, D\. K\. Williams, and D\. W\. Hosmer \(2008\)Purposeful selection of variables in logistic regression\.Source code for biology and medicine3\(\),pp\. 17\.Cited by:[§3\.2\.2](https://arxiv.org/html/2606.11769#S3.SS2.SSS2.p2.1)\.
- C\. Buttaboni and L\. Floridi \(2026\)A regulatory taxonomy of AI opacity in the EU: rethinking transparency, traceability, interpretability, and explainability\.AI and Ethics6\.External Links:[Link](https://doi.org/10.1007/s43681-025-00940-0),[Document](https://dx.doi.org/10.1007/s43681-025-00940-0)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx2.p1.1)\.
- G\. Casella and R\. L\. Berger \(2002\)Statistical inference\.2nd edition,Duxbury,Pacific Grove, CA, USA\.External Links:ISBN 0\-534\-24312\-6Cited by:[§3](https://arxiv.org/html/2606.11769#S3.p1.1)\.
- C\. T\. Castán \(2024\)The legal concept of artificial intelligence: the debate surrounding the definition of ai system in the ai act\.BioLaw Journal \- Rivista di BioDiritto,pp\. 305–344\.External Links:[Link](https://doi.org/10.15168/2284-4503-3000),[Document](https://dx.doi.org/10.15168/2284-4503-3000)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- Council of the European Union \(2024\)Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence \(artificial intelligence act\) and amending certain union legislative acts\.Note:ST 14954/22External Links:[Link](https://data.consilium.europa.eu/doc/document/ST-14954-2022-INIT/en/pdf)Cited by:[§1](https://arxiv.org/html/2606.11769#S1.p4.1)\.
- T\. M\. Cover and P\. E\. Hart \(1967\)Nearest neighbor pattern classification\.IEEE Transactions on Information Theory13\(1\),pp\. 21–27\.External Links:[Document](https://dx.doi.org/10.1109/TIT.1967.1053964)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx3.p1.1)\.
- J\. Davis and M\. Goadrich \(2006\)The relationship between precision\-recall and roc curves\.InProceedings of the 23rd International Conference on Machine Learning,ICML ’06,New York, NY, USA,pp\. 233–240\.External Links:ISBN 978\-1\-59593\-383\-2,[Link](https://doi.org/10.1145/1143844.1143874),[Document](https://dx.doi.org/10.1145/1143844.1143874)Cited by:[Definition A\.3](https://arxiv.org/html/2606.11769#A1.Thmtheorem3.p1.4),[Definition A\.4](https://arxiv.org/html/2606.11769#A1.Thmtheorem4.p1.3),[Definition A\.5](https://arxiv.org/html/2606.11769#A1.Thmtheorem5.p1.2),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p10.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p9.1)\.
- M\. Ebers, V\. R\. S\. Hoch, F\. Rosenkranz, H\. Ruschemeier, and B\. Steinrötter \(2021\)The european commission’s proposal for an artificial intelligence act—a critical assessment by members of the robotics and ai law society \(rails\)\.J4\(4\),pp\. 589–603\.External Links:ISSN 2571\-8800,[Link](https://doi.org/10.3390/j4040043),[Document](https://dx.doi.org/10.3390/j4040043)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- J\. Ellul \(2022\)Should we regulate artificial intelligence or some uses of software?\.Discover Artificial Intelligence2\(\),pp\.\.External Links:[Document](https://dx.doi.org/10.1007/s44163-022-00021-9)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- European Commission \(2021\)Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence \(artificial intelligence act\) and amending certain union legislative acts\.Note:COM\(2021\) 206 final, CELEX:52021PC0206External Links:[Link](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206)Cited by:[§1](https://arxiv.org/html/2606.11769#S1.p1.1),[§1](https://arxiv.org/html/2606.11769#S1.p4.1)\.
- European Commission \(2024\)External Links:[Link](https://digital-strategy.ec.europa.eu/en/news/commission-launches-consultation-ai-act-prohibitions-and-ai-system-definition)Cited by:[§1](https://arxiv.org/html/2606.11769#S1.p3.1)\.
- European Commission \(2025\)Guidelines on the definition of an artificial intelligence system established by regulation \(eu\) 2024/1689 \(ai act\)\.Note:Regulation \(EU\) 2024/1689External Links:[Link](https://digital-strategy.ec.europa.eu/en/library/commission-publishes-guidelines-ai-system-definition-facilitate-first-ai-acts-rules-application)Cited by:[§1](https://arxiv.org/html/2606.11769#S1.p3.1),[§3\.1\.5](https://arxiv.org/html/2606.11769#S3.SS1.SSS5.p2.1),[§3\.1\.5](https://arxiv.org/html/2606.11769#S3.SS1.SSS5.p5.1),[§3](https://arxiv.org/html/2606.11769#S3.p3.1)\.
- D\. Fernández\-Llorca, E\. Gómez, I\. Sánchez, and G\. Mazzini \(2025\)An interdisciplinary account of the terminological choices by eu policymakers ahead of the final agreement on the ai act: ai system, general purpose ai system, foundation model, and generative ai\.Artificial Intelligence and Law33\(4\),pp\. 875–888\.External Links:ISSN 1572\-8382,[Link](https://doi.org/10.1007/s10506-024-09412-y),[Document](https://dx.doi.org/10.1007/s10506-024-09412-y)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- S\. Finlay \(2012\)Credit scoring, response modeling, and insurance rating: a practical guide to forecasting consumer behavior\.2nd edition,Palgrave Macmillan,London, UK\.External Links:ISBN 978\-0\-230\-34776\-2,[Link](https://doi.org/10.1057/9781137031693),[Document](https://dx.doi.org/10.1057/9781137031693)Cited by:[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p10.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p3.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p6.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p9.1),[footnote 7](https://arxiv.org/html/2606.11769#footnote7)\.
- G\. Finocchiaro \(2024\)The regulation of artificial intelligence\.AI & SOCIETY39\(\),pp\. 1961–1968\.External Links:[Document](https://dx.doi.org/10.1007/s00146-023-01650-z)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- L\. Floridi \(2023\)On the brussels\-washington consensus about the legal definition of artificial intelligence\.Philosophy & Technology36\(4\)\.External Links:ISSN 2210\-5441,[Link](https://doi.org/10.1007/s13347-023-00690-z),[Document](https://dx.doi.org/10.1007/s13347-023-00690-z)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- C\. Fusion and W\. Cukierski \(2011\)Give me some credit\.Note:[https://kaggle\.com/competitions/GiveMeSomeCredit](https://kaggle.com/competitions/GiveMeSomeCredit)KaggleCited by:[Appendix B](https://arxiv.org/html/2606.11769#A2.p1.1),[§4\.2](https://arxiv.org/html/2606.11769#S4.SS2.p1.1)\.
- I\. J\. Goodfellow, D\. Erhan, P\. L\. Carrier, A\. Courville, M\. Mirza, B\. Hamner, W\. Cukierski, Y\. Tang, D\. Thaler, D\. Lee, Y\. Zhou, C\. Ramaiah, F\. Feng, R\. Li, X\. Wang, D\. Athanasakis, J\. Shawe\-Taylor, M\. Milakov, J\. Park, R\. Ionescu, M\. Popescu, C\. Grozea, J\. Bergstra, J\. Xie, L\. Romaszko, B\. Xu, Z\. Chuang, and Y\. Bengio \(2013\)Challenges in representation learning: a report on three machine learning contests\.InNeural Information Processing,M\. Lee, A\. Hirose, Z\. Hou, and R\. M\. Kil \(Eds\.\),Berlin, Heidelberg,pp\. 117–124\.External Links:ISBN 978\-3\-642\-42051\-1,[Link](https://doi.org/10.1007/978-3-642-42051-1_16),[Document](https://dx.doi.org/10.1007/978-3-642-42051-1%5F16)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx3.p1.1),[§3](https://arxiv.org/html/2606.11769#S3.p9.1)\.
- I\. Guyon and A\. Elisseeff \(2003\)An introduction to variable and feature selection\.Journal of Machine Learning Research3,pp\. 1157–1182\.External Links:ISSN 1532\-4435Cited by:[§3](https://arxiv.org/html/2606.11769#S3.p9.1)\.
- P\. Hacker and M\. Eber \(2025\)The future of credit underwriting and insurance under the EU AI act: implications for europe and beyond\.Harvard Data Science Review7\(3\)\.External Links:[Link](https://hdsr.mitpress.mit.edu/pub/19cwd6qx)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx2.p1.1)\.
- P\. Hacker \(2024\)External Links:[Link](https://www.europeannewschool.eu/images/chairs/hacker/Comments%20on%20the%20AI%20Act.pdf)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1),[§3](https://arxiv.org/html/2606.11769#S3.p2.1)\.
- D\. J\. Hand and W\. E\. Henley \(1997\)Statistical classification methods in consumer credit scoring: a review\.Journal of the Royal Statistical Society Series A: Statistics in Society160\(3\),pp\. 523–541\.External Links:ISSN 0964\-1998,[Link](https://doi.org/10.1111/j.1467-985X.1997.00078.x),[Document](https://dx.doi.org/10.1111/j.1467-985X.1997.00078.x)Cited by:[Definition A\.1](https://arxiv.org/html/2606.11769#A1.Thmtheorem1.p1.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p7.1)\.
- T\. Hastie, R\. Tibshirani, and J\. Friedman \(2009\)The elements of statistical learning: data mining, inference, and prediction\.2nd edition,Springer,New York, NY, USA\.External Links:ISBN 978\-0\-387\-84857\-0,[Link](https://doi.org/10.1007/978-0-387-84858-7),[Document](https://dx.doi.org/10.1007/978-0-387-84858-7)Cited by:[§3\.2\.1](https://arxiv.org/html/2606.11769#S3.SS2.SSS1.p2.12)\.
- D\. W\. Hosmer and S\. Lemeshow \(2000\)Applied logistic regression\.John Wiley and Sons\.External Links:ISBN 0471356328, 9780471356325Cited by:[§3\.2\.2](https://arxiv.org/html/2606.11769#S3.SS2.SSS2.p2.1)\.
- S\. Lessmann, B\. Baesens, H\. Seow, and L\. C\. Thomas \(2015\)Benchmarking state\-of\-the\-art classification algorithms for credit scoring: an update of research\.European Journal of Operational Research247\(1\),pp\. 124–136\.External Links:ISSN 0377\-2217,[Link](https://doi.org/10.1016/j.ejor.2015.05.030),[Document](https://dx.doi.org/10.1016/j.ejor.2015.05.030)Cited by:[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p9.1)\.
- T\. M\. Mitchell \(1997\)Machine learning\.McGraw\-Hill,New York, NY, USA\.External Links:ISBN 0\-07\-042807\-7Cited by:[§3](https://arxiv.org/html/2606.11769#S3.p5.1)\.
- M\. L\. Montagnani, M\. Najjar, and A\. Davola \(2024\)The EU regulatory approach\(es\) to AI liability, and its application to the financial services market\.Computer Law & Security Review53\.External Links:ISSN 2212\-473X,[Link](https://doi.org/10.1016/j.clsr.2024.105984),[Document](https://dx.doi.org/10.1016/j.clsr.2024.105984)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx2.p1.1)\.
- OECD \(2024\)Recommendation of the council on artificial intelligence\.Technical reportOrganisation for Economic Co\-operation and Development,Paris, France\.Note:OECD/LEGAL/0449External Links:[Link](https://oecd.ai/en/assets/files/OECD-LEGAL-0449-en.pdf)Cited by:[§1](https://arxiv.org/html/2606.11769#S1.p2.1)\.
- Oxford English Dictionary \(2025\)Inference\.External Links:[Link](https://www.oed.com/dictionary/inference_n?tl=true)Cited by:[§3](https://arxiv.org/html/2606.11769#S3.p1.1)\.
- G\. Pavlidis \(2024\)Unlocking the black box: analysing the EU artificial intelligence act’s framework for explainability in AI\.Law, Innovation and Technology16\(1\),pp\. 293–308\.External Links:[Link](https://doi.org/10.1080/17579961.2024.2313795),[Document](https://dx.doi.org/10.1080/17579961.2024.2313795)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx2.p1.1)\.
- M\. Á\. Presno Linera and A\. Meuwese \(2025\)Regulating AI from europe: a joint analysis of the AI act and the framework convention on AI\.The Theory and Practice of Legislation13\(3\),pp\. 292–311\.External Links:[Link](https://doi.org/10.1080/20508840.2025.2492524),[Document](https://dx.doi.org/10.1080/20508840.2025.2492524)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- J\. R\. Quinlan \(1986\)Induction of decision trees\.Machine Learning1\(1\),pp\. 81–106\.External Links:ISSN 1573\-0565,[Link](https://doi.org/10.1007/BF00116251),[Document](https://dx.doi.org/10.1007/BF00116251)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx3.p1.1)\.
- S\. Russell and P\. Norvig \(2021\)Artificial Intelligence, Global Edition A Modern Approach\.Pearson Deutschland\.External Links:ISBN 9781292401133,[Document](https://dx.doi.org/),LinkCited by:[§3\.1\.5](https://arxiv.org/html/2606.11769#S3.SS1.SSS5.p5.1)\.
- J\. Schuett \(2023\)Defining the scope of ai regulations\.Law, Innovation and Technology15\(1\),pp\. 60–82\.External Links:[Link](https://doi.org/10.1080/17579961.2023.2184135),[Document](https://dx.doi.org/10.1080/17579961.2023.2184135)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.
- N\. Siddiqi \(2006\)Credit risk scorecards: developing and implementing intelligent credit scoring\.John Wiley & Sons, Inc\.,Hoboken, NJ, USA\.External Links:ISBN 978\-0\-471\-75451\-0Cited by:[Definition A\.2](https://arxiv.org/html/2606.11769#A1.Thmtheorem2.p1.4),[Definition A\.3](https://arxiv.org/html/2606.11769#A1.Thmtheorem3.p1.4),[Definition A\.4](https://arxiv.org/html/2606.11769#A1.Thmtheorem4.p1.3),[Definition A\.5](https://arxiv.org/html/2606.11769#A1.Thmtheorem5.p1.2),[Appendix B](https://arxiv.org/html/2606.11769#A2.p3.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p10.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p12.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p4.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p5.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p6.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p8.1),[§4\.1](https://arxiv.org/html/2606.11769#S4.SS1.p9.1),[§4\.2\.2](https://arxiv.org/html/2606.11769#S4.SS2.SSS2.p1.1)\.
- S\. Singh, A\. Schupbach, A\. Asiala, and D\. A\. Siwecki \(2025\)Note:ECB Supervision NewsletterExternal Links:[Link](https://www.bankingsupervision.europa.eu/press/supervisory-newsletters/newsletter/2025/html/ssm.nl251120_1.en.html)Cited by:[§1](https://arxiv.org/html/2606.11769#S1.p3.1)\.
- G\. Spindler \(2023\)Algorithms, credit scoring, and the new proposals of the EU for an AI act and on a consumer credit directive\.Law and Financial Markets Review15\(3\-4\),pp\. 239–261\.External Links:[Link](https://doi.org/10.1080/17521440.2023.2168940),[Document](https://dx.doi.org/10.1080/17521440.2023.2168940)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx2.p1.1)\.
- V\. N\. Vapnik and A\. Y\. Chervonenkis \(1971\)On the uniform convergence of relative frequencies of events to their probabilities\.Theory of Probability and its ApplicationsXVI\(2\),pp\. 264–280\.Cited by:[footnote 3](https://arxiv.org/html/2606.11769#footnote3)\.
- M\. Veale and F\. Zuiderveen Borgesius \(2021\)Demystifying the draft eu artificial intelligence act – analysing the good, the bad, and the unclear elements of the proposed approach\.Computer Law Review International22\(4\),pp\. 97–112\.External Links:[Link](https://doi.org/10.9785/cri-2021-220402),[Document](https://dx.doi.org/10.9785/cri-2021-220402)Cited by:[§2](https://arxiv.org/html/2606.11769#S2.SS0.SSSx1.p1.1)\.

## Appendix ADefinitions for scorecard development

The credit scorecard development workflow as laid out in Section[4\.1](https://arxiv.org/html/2606.11769#S4.SS1)utilizes several values encoding the relationship between default and non\-default data points, in an individual bin \(Weight of Evidence\) or a set of bins \(Information Value\), as well as metrics for the quality of a model \(AUROC, Gini coefficient, Average Precision\)\. They are defined as follows\.

###### Definition A\.1 \(Weight of Evidence \(WoE\)\)\.

The WoE framework expresses each bin as the logarithm of the ratio between non\-default and default frequencies as follows: For a binBB, the WoE is\(Hand and Henley,[1997](https://arxiv.org/html/2606.11769#bib.bib13)\)

\(4\)WoE\(B\)=ln⁡p0B−ln⁡p1B\.WoE\(B\)=\\ln p\_\{0\}^\{B\}\-\\ln p\_\{1\}^\{B\}\.

###### Definition A\.2 \(Information Value \(IV\)\)\.

The IV of a featureFFis calculated as

IV\(F\)=∑B∈BinsF\(p0B−p1B\)\(ln⁡\(p0B\)−ln⁡\(p1B\)\),IV\(F\)=\\sum\_\{B\\in\\text\{Bins\}\_\{F\}\}\(p\_\{0\}^\{B\}\-p\_\{1\}^\{B\}\)\\left\(\\ln\\left\(p\_\{0\}^\{B\}\\right\)\-\\ln\\left\(p\_\{1\}^\{B\}\\right\)\\right\),wherep0Bp\_\{0\}^\{B\}andp1Bp\_\{1\}^\{B\}are the proportions of ’good’ \(non\-default\) and ’bad’ \(default\) outcomes in binBB, respectively\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\.

###### Definition A\.3 \(Area under the Receiver Operating Characteristic Curve \(AUROC\)\)\.

For a modelmm,AUROC\(m\)AUROC\(m\)is the area under the curve that plots the False Positive Rate ofmmagainst its True Positive Rate, and takes values in\[0,1\]\[0,1\]\. A higher AUROC signifies better performance\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Davis and Goadrich,[2006](https://arxiv.org/html/2606.11769#bib.bib15)\)\.

###### Definition A\.4 \(Gini coefficient\)\.

The Gini coefficient of a modelmmis equivalent to the AUROC viaGini\(m\)=2⋅AUROC\(m\)−1Gini\(m\)=2\\cdot AUROC\(m\)\-1, with values in\[−1,1\]\[\-1,1\]and a higher value again signifying better performance\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Davis and Goadrich,[2006](https://arxiv.org/html/2606.11769#bib.bib15)\)\.

###### Definition A\.5 \(Average Precision \(AP\)\)\.

The AP of modelmmcorresponds to the area under the curve plotting its Precision against its Recall and takes values in\[0,1\]\[0,1\], the higher the better the performance\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4); Davis and Goadrich,[2006](https://arxiv.org/html/2606.11769#bib.bib15)\)\.

Note that the definitions of Shannon entropy and Information Gain, which are commonly used as loss functions for decision tree\-based binning methods, are already included in Section[3\.2\.1](https://arxiv.org/html/2606.11769#S3.SS2.SSS1)and thus are not repeated here\.

## Appendix BCredit scorecard examples

The Give Me Some Credit Kaggle competition\(Fusion and Cukierski,[2011](https://arxiv.org/html/2606.11769#bib.bib52)\)provides separate files for training and test data\. Since the latter does not include the ground truth, we use and split the training data file exclusively\. It contains 150,000 instances with 11 features\.

For each of the features, a respective monotonicity constraint was applied in our examples in Section[4\.2](https://arxiv.org/html/2606.11769#S4.SS2)\. The definitions of the features and the rationales behind their expected directions of monotonicity are as follows:

- •SeriousDlqin2yrs \(binary\):Person experienced 90 days past due delinquency or worse\. \(Target variable\)
- •MonthlyIncome \(short nameMoInc, real\):Monthly income\. The higher the income, the more money is available, the lower the risk of default\. Missing values treated as separate bin\.
- •RevolvingUtilizationOfUnsecuredLines \(RevUtil, real\):The borrower’s total balance on credit cards and personal lines of credit, except real estate and installment debt like car loans, divided by the sum of credit limits\. The higher the value, the higher the risk of default\.
- •DebtRatio \(real\):The borrower’s monthly debt payments, alimony, and living costs, divided by monthly gross income\. The higher the value, the less income available, thus, the higher the risk of default\.
- •NumberOfOpenCreditLinesAndLoans \(NumLinesLoans, integer\):Number of open loans and lines of credit \(e\.g\. credit cards\)\. The more open loans, the higher the risk of default, with value 0 excepted from monotonicity\.
- •NumberRealEstateLoansOrLines \(NumRealEstate, integer\):Number of mortgage and real estate loans including home equity lines of credit\. The more open loans, the higher the risk of default, with value 0 excepted from monotonicity\.
- •NumberOfTime30\-59DaysPastDueNotWorse \(N30\-59Late, integer\):Number of times borrower has been 30–59 days past due in the last 2 years\. The higher the number of previous delays, the higher the risk of default\.
- •NumberOfTime60\-89DaysPastDueNotWorse \(N60\-89Late, integer\):Number of times borrower has been 60–89 days past due in the last 2 years\. The higher the number of previous delays, the higher the risk of default\.
- •NumberOfTimes90DaysLate \(N90Late,integer\):Number of times borrower has been 90 days or more past due before\. The higher the number of previous defaults, the higher the risk of default\.
- •Age \(integer\):Age of the borrower in years\. Excluded for fairness\.
- •NumberOfDependents \(integer\):Number of dependents in family excluding themselves \(spouse, children etc\.\)\. Excluded for fairness\.

Excepting the value 0 from the monotonicity requirement for certain features, as is the case here forNumLinesLoansandNumRealEstate, is common practice in credit scoring\. This is due to the fact that less information and a shorter \(if any\) repayment history may be available for applicants with no loans or lines, and very low utilization accounts can demonstrate higher risk\(Siddiqi,[2006](https://arxiv.org/html/2606.11769#bib.bib4)\)\.

The complete scorecards resulting from Examples I and II are presented in the following tables\. They offer a direct comparison of the results of the two approaches for one feature, each\.

Ex\. IEx\. IIBinWoEPtsWoEPts00\.39660\.3966≥1\\geq 1\-2\.302\-36\-2\.302\-35Table 4\.N90LateEx\. IEx\. IIBinWoEPtsWoEPts00\.53280\.53281\-0\.882\-13\-0\.882\-14≥2\\geq 2\-1\.895\-28\-1\.895\-29Table 5\.N30–59LateEx\. IEx\. IIBinWoEPtsWoEPts00\.29030\.2903≥1\\geq 1\-2\.081\-24\-2\.081\-25Table 6\.N60–89LateAs can be seen in Tables[4](https://arxiv.org/html/2606.11769#A2.T4),[5](https://arxiv.org/html/2606.11769#A2.T5), and[6](https://arxiv.org/html/2606.11769#A2.T6), both approaches resulted in the same binnings for the featuresN90Late,N30\-59Late, andN60\-89Late\. This is easily explained by the limited number of possible bin thresholds to choose from, since these are integer\-valued features\.

Ex\. IEx\. IIBinWoEPtsBinWoEPts\[0,0\.043\]\[0,0\.043\]1\.38627\[0,0\.1\]\[0,0\.1\]1\.33626\(0\.043,0\.067\]\(0\.043,0\.067\]1\.38127\(0\.1,0\.3\]\(0\.1,0\.3\]0\.85116\(0\.067,0\.132\]\(0\.067,0\.132\]1\.15323\(0\.3,0\.6\]\(0\.3,0\.6\]\-0\.0100\(0\.132,0\.184\]\(0\.132,0\.184\]0\.93218\(0\.6,0\.9\]\(0\.6,0\.9\]\-0\.774\-15\(0\.184,0\.301\]\(0\.184,0\.301\]0\.65913≥0\.9\\geq 0\.9\-1\.389\-27\(0\.301,0\.396\]\(0\.301,0\.396\]0\.2345\(0\.396,0\.5\]\(0\.396,0\.5\]0\.0501\(0\.5,0\.855\]\(0\.5,0\.855\]\-0\.603\-12\(0\.855,0\.989\]\(0\.855,0\.989\]\-1\.190\-23≥0\.989\\geq 0\.989\-1\.447\-28Table 7\.RevUtilEx\. IEx\. IIBinWoEPtsBinWoEPtsMissing0\.2113Missing0\.2112\[0,3332\.5\]\[0,3332\.5\]\-0\.355\-5\[0,3000\]\[0,3000\]\-0\.334\-3\(3332\.5,3920\.5\]\(3332\.5,3920\.5\]\-0\.246\-3\(3000,5000\]\(3000,5000\]\-0\.227\-2\(3920\.5,4621\.5\]\(3920\.5,4621\.5\]\-0\.220\-3\(5000,7000\]\(5000,7000\]0\.0090\(4621\.5,5563\.5\]\(4621\.5,5563\.5\]\-0\.106\-1\(7000,10000\]\(7000,10000\]0\.2312\(5563\.5,6550\.5\]\(5563\.5,6550\.5\]0\.0020≥10000\\geq 100000\.4645\(6550\.5,7656\.5\]\(6550\.5,7656\.5\]0\.1902\(7656\.5,10284\]\(7656\.5,10284\]0\.2864≥10284\\geq 102840\.4746Table 8\.MoIncWhere more potential split points were available, namely the real\-valued featuresRevUtil\(Table[7](https://arxiv.org/html/2606.11769#A2.T7)\) andMoInc\(Table[8](https://arxiv.org/html/2606.11769#A2.T8)\), the bins differ significantly between the semi\-automated and manual approaches\.

Note that, while five of the features selected or the final model are the same across both experiments, the semi\-automated approach resulted inDebtRatiobeing selected as the sixth feature whereasNumRealEstatewas chosen in the manual approach\.

BinWoEPts\[0,0\.020\]\[0,0\.020\]0\.2518\(0\.020,0\.346\]\(0\.020,0\.346\]0\.1254\(0\.346,0\.423\]\(0\.346,0\.423\]0\.0481\(0\.423,0\.505\]\(0\.423,0\.505\]\-0\.072\-2≥0\.505\\geq 0\.505\-0\.162\-5Table 9\.DebtRatio \(only Example I\)BinWoEPts0\-0\.235\-3\{1,2\}\\\{1,2\\\}0\.23333\-0\.067\-1≥4\\geq 4\-0\.582\-8Table 10\.NumRealEstate \(only Example II\)
When Do Data-Driven Systems Exhibit the Capability to Infer?

Similar Articles

AI inference just plays by different rules (9 minute read)

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

A Guide to AI Inference Engineering (17 minute read)

Adaptive auditing of AI systems with anytime-valid guarantees

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

Submit Feedback

Similar Articles

AI inference just plays by different rules (9 minute read)
The Illusion of Improvement: Reject Inference Strategies in Credit Scoring
A Guide to AI Inference Engineering (17 minute read)
Adaptive auditing of AI systems with anytime-valid guarantees
The Digital Apprentice: A Framework for Human-Directed Agentic AI Development