- The Fenced Forest -
← Back to home

10 mins read
When the business units arrived with a wishlist of AI features, we brought the conversation back upstream to understand what they actually needed. We then drafted a framework to de-risk the inevitable throwaway costs and balance user needs against the hype. Basically helping the team to be prudent.
We then interviewed employees across different functions to see where GenAI was already showing up in their daily work. Those conversations shaped two beta use cases directly:
Both use cases went through several rounds of co-creation and paper testing before we proceeded with build and rollout. In the months that followed, we monitored feedback closely.
As it rolled in, one of the consistent patterns I captured pointed to accountability fear. If an AI-assisted decision is later questioned, who answers for it? The model can't. And in a bank, that question is never hypothetical.
Accountability fear wasn't a single concern. It sat across multiple layers at once. Compliance exposure, regulatory scrutiny, customer data sensitivity. Each one serious on its own. Together, they made the question of ownership feel genuinely untenable.
Even if users were hypothetically required to use it, every output still needed to be thoroughly checked. So why not just do it themselves?
That pattern surfaced across multiple function and teams. Training gaps, data constraints, tooling limitations, all of it was real and all of it mattered. But underneath those layers was something harder to resolve through infrastructure alone.
What fuelled this lack of trust?
The markers of discomfort could be grouped into two broad camps, and they point in different directions:
Both fears point to the same problem. The model's probabilistic nature is misaligned with the deterministic expectations is a must in enterprise use cases.
Basically, GenAI is built to guess well. Enteprise is built to be right. Those 2 things are in direct conflict.
Telling users to trust the output more isn't an answer. Neither is better onboarding. What if the problem actually demands a way to make the model's behaviour legible? Surfacing its uncertainty, stress-testing its outputs, and putting a human with the right context at the point where the decision gets made. The question isn't whether to use GenAI. It's whether we can design the conditions under which its outputs are genuinely defensible, and I believe that’s where early-stage ‘trust’ are planted.
The Shadow AI contradiction
Worth addressing that Microsoft and LinkedIn's 2024 Work Trend Index ↗ found that 78% of AI users at work bring their own tools through personal accounts. To that, we were deliberate about the mix when recruiting. Some participants had little to no familiarity with GenAI. Others had it quietly folded into their personal routines already.
What both groups shared was the same hesitation once the context shifted. GenAI sits comfortably in casual use, but the moment real consequences attach to the output, trust contracts. That shift in tolerance is probably where the real design problem lives.
The hesitation I observed probably had less to do with not understanding the technology, more with what using it officially actually meant. In a regulated environment, a sanctioned tool means owning what comes out of it. That's a different ask than reaching for a personal tool, and I suspect that gap is what drove the hesitation we kept seeing.
So how do we design for it?
GenAI doesn't reason, it generates the most statistically probable continuation of a prompt. It can be simultaneously fluent and fabricating, which is a liability for anyone producing a defensible recommendation or trying to guarantee outputs are free of bias.
The challenge isn't to hide this ‘black box’ nature, but what if we scaffold it with predictive controls to make outputs low-entropy, explainable, and auditable?
To answer those, I broke the issue down into two parts to be solved.
For this I looked to the Cybernetic Control Model ↗ and reimagined it to fit the 3 staged scaffolding.
The idea is to reframe the AI is the processing Engine, fast, generative, pattern-matching at scale. The human is the sovereign Governor with final decision-making authority, responsible for grounding, validating, and contextualising the engine's outputs. The relationship is hierarchical by design, not because AI can't be capable, but because accountability in enterprise contexts has to sit somewhere legible.
This isn't a new insight dressed up in new language. I modelled it around established ML concepts, with direct reference to Constitutional AI ↗, Self-Refine ↗, and LLM-as-a-Judge ↗. and turns them into a portable, parameterised, context-agnostic prompt wrapped in scaffold with human-in-the-loop governance.
What's different now is the urgency, and the specificity of what ‘oversight’ control needs to look like when the engine is a probabilistic, emergent surprise generator rather than a deterministic software tool. (Update: As of 2026, the EU is already phasing in requirements for human-in-loop interactions under the EU Artificial Intelligence Act ↗.)

If you want to see how this runs in practice, I made a companion tool that walks through both modes. There's a manual version that takes you through each stage individually, so you can see what the Engine is doing at each step. There's also a master prompt that automates Stages 1 and 2 entirely. The loop runs until parameters are passed. Stage 3 stays manual by design.

A companion tool to test the scaffolding in your own context.
governor-engine-scaffold.netlify.app/
After Stage 2, the Engine scores its own output against three checks. If any fail, it loops back. All three must pass before the output reaches the Governor.

3 questions had to be asked after every stage. Did it survive a stress test? Has it stopped changing meaningfully? Does it actually make sense based on added context? If all 3 answers were yes, we moved forward.
In the companion tool, I also repackaged these instructions into a single, LLM-agnostic master prompt with instructions to loop and test recursively until the passing criteria are met.
Below are several competing frameworks and how they measured against the 3-Stage Process. Note that these aren’t rigorous measurements by any means.

In attempt to add rigour, the Governor-Engine Scaffold consumes token heavily (tested on Alphabet’s Gemini and Anthropic’s Claude). However, users are able to adjust passing parameters to fit whatever level of rigour the use case demands. This in theory allows control on token use. In fact, an earlier draft of the diagram required Stage 1 and Stage 2 to run their own recursive checks and loop within each stage. Testing and retesting variants of these master prompts caused me to almost hit the 5-hour limit on Claude Pro.
The obvious next problem to solve is efficiency, which I suspect data scientists and LLM engineers are already working on it.
As I worked through the fears, scaffolding, and metrics, a clearer picture of what trust actually requires started to form. I believe there’s 4 aspects will make it a more palatable for adoption:
None of these are purely technical problems. All of them are, a combination of process, people, and design problems.
I feel organisations getting the most from GenAI aren't the ones who've found a way to trust the model unconditionally. They're the ones who've designed systems where outputs are always accountable to a human with the context, the authority, and the scaffolding to interrogate them. (Update: There’s a growing industry termed Observability).
The irony wasn't lost on us. GenAI was being sold everywhere as the great productivity unlock, and there we were, watching employees hesitate at the threshold. The hype had arrived. The trust hadn't.
This pattern isn't new. The Gartner Hype Cycle↗ has a name for it, every transformative technology ‘crests’ on inflated expectations before the inevitable slide into disillusionment. What follows, for the technologies that survive, is a slower climb built on realistic, hard-won understanding. Just like the Dot-com Bubble, GenAI is no different.
When the ‘crest’ breaks, the loudest voices quiet down and the more useful ones, the practitioners, the level-headed adopters, the people who've actually tried to get the blimp off the ground, start to be heard. Leadership will finally get their GenAI tool for deep enterprise use cases, but probably only after enough pilots stall.
To me, the probabilistic engine is powerful. Taming it isn't about constraining it. It's about knowing exactly who's holding the reins.
I'll also be the first to admit this is one slice of a much larger problem. Properly solving for trust in enterprise GenAI requires a multi-specialist efforts from ML engineers, compliance leads, legal, change management, and many more. This piece approaches it from a design and governance angle, which is a limited one by definition. Take it as solutioning from that vantage point, not a complete answer.
That said, this piece is my attempt at exploring whether that answer can be made portable. It draws on established ML concepts and cognitive science, but the intent was never purely theoretical. I wanted to see if the conditions for trust, verification, auditability, human oversight, could be packaged into something any non-technical staff could pick up, regardless of their technical setup. The companion tool is the practical test of that. Whether it holds in your context is the real question.
A small footnote: after publishing this, Anthropic released /Skills and .md, and I see it as scaffolding that operate on similar principles. Seems that 2026 is shaping up to be the year the industry stops treating human oversight as a philosophical position and starts treating it as an engineering requirement.
Additional reading
© 2025–2026 Kevyn Leong
- The Fenced Forest -
← Back to home

10 mins read
In 2024–2025, I worked with business units in a bank to roll-out beta GenAI features into live, high-trust workflows. I saw adoption hesitance. It wasn't just the technology. It was a quieter human problem that showed up consistently across every function that needed deeper unpacking. Project timelines moved, so I never had that luxury.
At my own time, I explored behavioural and theoretical aspects of the outcome. What trust actually requires when the technology is probabilistic but the environment isn't? How can we design systems to earn trust?
The practical account of how we shipped into that environment is covered separately in Shipping to Learn, Not to Impress.
--
Warning: This piece gets into the weeds quickly. I spent too much time and went off the deep end.
When the business units arrived with a wishlist of AI features, we brought the conversation back upstream to understand what they actually needed. We then drafted a framework to de-risk the inevitable throwaway costs and balance user needs against the hype. Basically helping the team to be prudent.
We then interviewed employees across different functions to see where GenAI was already showing up in their daily work. Those conversations shaped two beta use cases directly:
Both use cases went through several rounds of co-creation and paper testing before we proceeded with build and rollout. In the months that followed, we monitored feedback closely.
As it rolled in, one of the consistent patterns I captured pointed to accountability fear. If an AI-assisted decision is later questioned, who answers for it? The model can't. And in a bank, that question is never hypothetical.
Accountability fear wasn't a single concern. It sat across multiple layers at once. Compliance exposure, regulatory scrutiny, customer data sensitivity. Each one serious on its own. Together, they made the question of ownership feel untenable.
Even if users were hypothetically required to use it, every output still needed to be thoroughly checked. So why not just do it themselves?
That pattern surfaced across multiple function and teams. Training gaps, data constraints, tooling limitations, all of it was real and all of it mattered. But underneath those layers was something harder to resolve through infrastructure alone.
What fuelled this lack of trust?
The markers of discomfort could be grouped into two broad camps, and they point in different directions:
Both fears point to the same problem. The model's probabilistic nature is misaligned with the deterministic expectations is a must in enterprise use cases.
Basically, GenAI is built to guess well. Enteprise is built to be right. Those 2 things are in direct conflict.
Telling users to trust the output more isn't an answer. Neither is better onboarding. What if the problem actually demands a way to make the model's behaviour legible? Surfacing its uncertainty, stress-testing its outputs, and putting a human with the right context at the point where the decision gets made. The question isn't whether to use GenAI. It's whether we can design the conditions under which its outputs are genuinely defensible, and I believe that’s where early-stage ‘trust’ are planted.
The Shadow AI contradiction
Worth addressing that Microsoft and LinkedIn's 2024 Work Trend Index ↗ found that 78% of AI users at work bring their own tools through personal accounts. To that, we were deliberate about the mix when recruiting. Some participants had little to no familiarity with GenAI. Others had it quietly folded into their personal routines already.
What both groups shared was the same hesitation once the context shifted. GenAI sits comfortably in casual use, but the moment real consequences attach to the output, trust contracts. That shift in tolerance is probably where the real design problem lives.
The hesitation I observed probably had less to do with not understanding the technology, more with what using it officially actually meant. In a regulated environment, a sanctioned tool means owning what comes out of it. That's a different ask than reaching for a personal tool, and I suspect that gap is what drove the hesitation we kept seeing.
So how do we design for it?
GenAI doesn't reason, it generates the most statistically probable continuation of a prompt. It can be simultaneously fluent and fabricating, which is a liability for anyone producing a defensible recommendation or trying to guarantee outputs are free of bias.
The challenge isn't to hide this ‘black box’ nature, but what if we scaffold it with predictive controls to make outputs low-entropy, explainable, and auditable?
To answer those, I broke the issue down into two parts to be solved.
For this I looked to the Cybernetic Control Model ↗ and reimagined it to fit the 3 staged scaffolding.
The idea is to reframe the AI is the processing Engine, fast, generative, pattern-matching at scale. The human is the sovereign Governor with final decision-making authority, responsible for grounding, validating, and contextualising the engine's outputs. The relationship is hierarchical by design, not because AI can't be capable, but because accountability in enterprise contexts has to sit somewhere legible.
This isn't a new insight dressed up in new language. I modelled it around established ML concepts, with direct reference to Constitutional AI ↗, Self-Refine ↗, and LLM-as-a-Judge ↗. and turns them into a portable, parameterised, context-agnostic prompt wrapped in scaffold with human-in-the-loop governance.
What's different now is the urgency, and the specificity of what ‘oversight’ control needs to look like when the engine is a probabilistic, emergent surprise generator rather than a deterministic software tool. (Update: As of 2026, the EU is already phasing in requirements for human-in-loop interactions under the EU Artificial Intelligence Act ↗.)

If you want to see how this runs in practice, I made a companion tool that walks through both modes. There's a manual version that takes you through each stage individually, so you can see what the Engine is doing at each step. There's also a master prompt that automates Stages 1 and 2 entirely. The loop runs until parameters are passed. Stage 3 stays manual by design.

A companion tool to test the scaffolding in your own context.
governor-engine-scaffold.netlify.app/
After Stage 2, the Engine scores its own output against three checks. If any fail, it loops back. All three must pass before the output reaches the Governor.

3 questions had to be asked after every stage. Did it survive a stress test? Has it stopped changing meaningfully? Does it actually make sense based on added context? If all 3 answers were yes, we moved forward.
In the companion tool, I also repackaged these instructions into a single, LLM-agnostic master prompt with instructions to loop and test recursively until the passing criteria are met.
Below are several competing frameworks and how they measured against the 3-Stage Process. Note that these aren’t rigorous measurements by any means.

In attempt to add rigour, the Governor-Engine Scaffold consumes token heavily (tested on Alphabet’s Gemini and Anthropic’s Claude). However, users are able to adjust passing parameters to fit whatever level of rigour the use case demands. This in theory allows control on token use. In fact, an earlier draft of the diagram required Stage 1 and Stage 2 to run their own recursive checks and loop within each stage. Testing and retesting variants of these master prompts caused me to almost hit the 5-hour limit on Claude Pro.
The obvious next problem to solve is efficiency, which I suspect data scientists and LLM engineers are already working on it.
As I worked through the fears, scaffolding, and metrics, a clearer picture of what trust actually requires started to form. I believe there’s 4 aspects will make it a more palatable for adoption:
None of these are purely technical problems. All of them are, a combination of process, people, and design problems.
I feel organisations getting the most from GenAI aren't the ones who've found a way to trust the model unconditionally. They're the ones who've designed systems where outputs are always accountable to a human with the context, the authority, and the scaffolding to interrogate them. (Update: There’s a growing industry termed Observability).
The irony wasn't lost on us. GenAI was being sold everywhere as the great productivity unlock, and there we were, watching employees hesitate at the threshold. The hype had arrived. The trust hadn't.
This pattern isn't new. The Gartner Hype Cycle↗ has a name for it, every transformative technology ‘crests’ on inflated expectations before the inevitable slide into disillusionment. What follows, for the technologies that survive, is a slower climb built on realistic, hard-won understanding. Just like the Dot-com Bubble, GenAI is no different.
When the ‘crest’ breaks, the loudest voices quiet down and the more useful ones, the practitioners, the level-headed adopters, the people who've actually tried to get the blimp off the ground, start to be heard. Leadership will finally get their GenAI tool for deep enterprise use cases, but probably only after enough pilots stall.
To me, the probabilistic engine is powerful. Taming it isn't about constraining it. It's about knowing exactly who's holding the reins.
I'll also be the first to admit this is one slice of a much larger problem. Properly solving for trust in enterprise GenAI requires a multi-specialist efforts from ML engineers, compliance leads, legal, change management, and many more. This piece approaches it from a design and governance angle, which is a limited one by definition. Take it as solutioning from that vantage point, not a complete answer.
That said, this piece is my attempt at exploring whether that answer can be made portable. It draws on established ML concepts and cognitive science, but the intent was never purely theoretical. I wanted to see if the conditions for trust, verification, auditability, human oversight, could be packaged into something any non-technical staff could pick up, regardless of their technical setup. The companion tool is the practical test of that. Whether it holds in your context is the real question.
A small footnote: after publishing this, Anthropic released /Skills and .md, and I see it as scaffolding that operate on similar principles. Seems that 2026 is shaping up to be the year the industry stops treating human oversight as a philosophical position and starts treating it as an engineering requirement.
Additional reading
© 2025–2026 Kevyn Leong