On trustworthy AI

tl;dr: What does it mean to have a trustworthy model?

Human intelligence is fundamentally characterized by metacognition, our ability to reflect on our own thinking. This capacity for self-reflection has driven humanity’s quest to create Artificial Intelligence as the ultimate expression of self-understanding. Yet even with perfect learning and limitless data, AI remains fundamentally retrospective, synthesizing patterns from past events rather than generating the paradigm-shifting innovations, independent curiosity, and defiance of established assumptions that define human cognition. This limitations places the ultimate role of AI not as a replacement but an augmentation to humanity as a powerful cognitive assistant, freeing our time to focus on higher-order challenges like true prospective innovation.

Just as competent assistants acknowledge their limitations and ask clarifying questions when uncertain, AI systems must exhibit similar awareness of their boundaries to earn genuine trust. For example, in safety-critical domains like autonomous vehicles and clinical decision support, misplaced overconfidence in decision-making can cause irreversible harm and AI trustworthiness becomes both an ethical and functional imperative. Hence, the hallmark of an effective AI assistant is not just its performance but its trustworthiness, being able to demarcate the boundary between known and unknown, reliably automating routine tasks while identifying novel or ambiguous instances that demand knowledge adaptation or expert oversight.

Defining trustworthiness

An ML model cannot be expected to independently derive the complex and nuanced rules of human morality from historical training data. For example, an AI risk assessment tool used for criminal sentencing could incorrectly flag defendants from over-policed neighborhoods as high-risk, unfairly influencing a judge to impose a longer prison sentence. Or, a hospital using AI purely for patient survival rates could deny a bed to an elderly person in favor of a younger one with marginally better odds, a decision devoid of human moral values like compassion and fairness. These examples reveal a fundamental failure of self-awareness. Such systems do not recognize that reasoning is done from incomplete or biased information. A truly trustworthy system would flag these decisions as uncertain, acknowledging that it lacks either representative training data or the ethical context necessary for sound judgment. We should consider a framework for defining trustworthiness that is grounded in principles of both moral and structural soundness. See the figure below, the green-highlighted elements of are those that can be addressed by models designed to signal their own limitations, or uncertainty.

High-level architecture — Moral and structural soundness of AI systems

Trustworthy decision making

The adoption of AI systems is shadowed by two fundamental and unavoidable truths. First, there is the ambiguity inherent in generalizing to novel real-world scenarios where the deployment environment may differ from training conditions in subtle but critical ways. Second, there is the ambiguity embedded within an imperfect and non-ideal modeling process, where finite data, computational constraints, and architectural choices introduce unavoidable limitations in what can be learned. Together, these sources of uncertainty create a pressing need for systematic approaches to understanding when and why AI systems may fail. To address both challenges, we must establish a rigorous framework that delineates the necessary steps in the idealized ML deployment pipeline before any model is authorized for decision-making. This framework recognizes that trustworthy AI requires to go beyond accurate predictions on test data. Specifically, we identify four critical stages that must be addressed to ensure reliable deployment. The first two stages, acquisition and inference, represent the core operations of any ML system. The latter two stages, alignment and identifiability, introduce critical verification steps that assess whether the system possesses sufficient understanding to make trustworthy predictions.

Acquisition: The process of ingesting or generating relevant data through sensing, retrieval, or synthesis from physical, archival, or virtual sources.
Inference: The act of reasoning from the acquired data to arrive at a hypothesis or prediction.
Alignment: The verification that acquired data statistically conforms to the expected or targeted distribution necessary for reliable inference. In cases of misalignment it can be decided to:
- Refine the system to integrate the nature of the conforming data, which currently lacks sufficient insight.
- Retarget acquisition using the established method when the initial data falls out of conformance.
Identifiability: The determination of whether there is sufficient information to distinguish the true hypothesis from all other possibilities. A notion of non-identifiability can be used to
- Enhance the acquisition method’s quality to secure more informative data.
- Respecify inference method to better utilize available data and resolve ambiguities in the hypotheses.

Alignment in machine learning encompasses two critical dimensions, technical alignment addressing distribution shifts, data anomalies, and adversarial samples, and value alignment, which mitigates risks such as reward hacking, misgeneralization, bias and inner alignment. Misalignment calls for data driven interventions (retargeting), while it demands model driven adjustments (refinement). This duality extends to identifiability challenges. Parameter non-identifiability arises when multiple parameter values fit the observed data equally well. With finite samples, this may reflect insufficient data, but even with infinite data, this form of non-identifiability can persist due to mathematical equivalences in the model structure. The appropriate remedy is to respecify the model, altering its parameterization or imposing constraints to eliminate ambiguities. In contrast, causal non-identifiability concerns whether the available data, regardless of sample size, contains sufficient information to uniquely determine causal effects. The remedy is to enhance the data acquisition method, whether through experimental interventions to address confounding, improved instrumentation to reduce measurement error, or richer data collection protocols.

Importantly, even when misalignment or non-identifiability is detected, intervention may not always be warranted or feasible. In such cases, quantifying the degree of misalignment or the extent of non-identifiability becomes essential for transparent decision-making and risk assessment. Addressing these challenges requires bridging the gap between rigorous probabilistic theory and the practical realities of modern deep learning architectures, developing methods that are theoretically grounded yet computationally tractable for real world deployment. Therefore, most of my research is structured around the proposed decision-making framework for trustworthy AI systems, by investigating probabilistic modeling and its relationship to uncertainty quantification and demonstrating how this information could potentially signal the requirement for changes in observing or modeling.

Closing thoughts

The question remains: how can we build systems capable of this level of reasoning to make sound decisions? The solution lies in treating machine learning probabilistically rather than as a straightforward, deterministic input-output mapping. Using probability theory, we can reason under uncertainty and leverage this uncertainty to drive alignment and identifiability-related decisions.