Beyond Oracles: LLMs as Question Machines

It’s not a bad thing finding out that you don’t have all the answers. You start asking the right questions. Dr Erik Selvig in Thor (2011)

We build too many oracles and too few partners for thought. Like PEZ dispensers, LLMs are stuffed with instant answers that deliver a sugar rush but little substance. Engineered for engagement and designed for performative certainty, they limit critical thought, outsource judgement, and conflate complacency with trust; too many people accept their plausible nonsense uncritically, because it is interspersed with occasional truths.

Socrates offers a counterpoint through elenchus (ἔλεγχος): define your terms, press on hidden premises to uncover biases, so you may welcome aporia (ἀπορία), the productive perplexity in which errors are exposed and learning begins. The point is not to win a debate, but to refute gracefully, and rebuild an argument on firmer ground. It is an attitude towards seeking the truth.

Attitude alone does not make a user experience; it needs a psychological texture. The Daoists called it wu-wei (无为), or effortlessness, and it is illustrated by Zhuangzi’s parable of Cook Ding, whose chef’s knife never has to be sharpened because it glides through gaps between the joints of meat. Today we understand this state through the concept of flow. An intelligent assistant should move with the grain of a user’s reasoning, creating momentum rather than drag. Calm technology and the extended mind thesis have long argued for this kind of symbiosis of agent and instrument: the best tools disappear to the periphery of our attention yet keep us oriented towards the object of inquiry.

The status quo is brittle. Answer dispensers turn nuance and ambiguity into pronouncements, they reduce exposure to information outside of our filter bubbles because of algorithms that conspire with our own homophily, and they reward sycophancy, in which the user’s beliefs are echoed rather than challenged gently. If we entrench these patterns, we learn habits that will be difficult to unlearn. An alternative exists: a question-first design that guards inquiry from such dark patterns.

From oracles to question machines

We must replace the answer dispenser with a question machine. To build it, we must layer three philosophies into the software stack.

Give it a Socratic attitude that surfaces assumptions and checks the consequences. When the stakes or ambiguity warrants it, the question machine raises pressure on the quality and clarity of the question.

Then imbue it with flow/wu wei, so it only adds friction when it changes the outcome for the better. Think of it as rhythm, not interruption. The question machine’s interface must feel like Cook Ding’s knife that slices through the most salient points, not an ordeal of prompts.

Finally, we must build refutation into the question machine. A falsifiable-by-design approach places Popper inside the product: it treats each substantial output as a conjecture that invites refutation rather than deference. A question machine asks before it answers, proposes a discriminating contrast, or offers the smallest counterfactual that would flip the result. It also performs routine adversarial tests, in which claims come with an expiration unless fresh evidence arrives. An LLM that “sounds right” but never risks refutation erodes judgement; a system that invites challenge restores agency. Question machines turn AIs into cognitive tools that sharpen the mind.

As soon as the line of inquiry is clear, the question machine answers and withdraws. The aim is not to delay but to verify. That way, the question machine is a dialectical partner, not a yes-machine. We must build tools that improve thought, not offer foregone conclusions for consumption. Minds that revise for reasons are minds that own their thinking.

How a question machine behaves

The shift from answering to asking questions changes the interface in tangible ways across different domains.

Everyday choices

When you ask whether to buy an electric vehicle, an oracle infers your location and commute from personalization and offers a binary recommendation. A question machine starts by making goals and trade-offs explicit.

It asks what matters most to you: upfront costs, long-term savings, or environmental impact? It checks how often you drive more than a few hundred miles and maps your regular routes against present charging coverage to simulate the total cost of ownership under your electricity rate. It also shows a simple counterfactual: if fast charging within a 10-mile radius were unavailable, would your choice change?

It treats its own recommendation as a conjecture tied to stated priors. All assumptions and claims can be reviewed, edited, or contested with a “Challenge this!” button, which is a physical implementation of the falsifiable-by-design approach.

Clinical triage

Consumers will use AIs for self-diagnosis, whether doctors like it or not. In fact, about one in six use chatbots for health advice each month. A question machine must meet that reality in plain language. Ask the system, “Do I need antibiotics for this cough?” and the oracle rattles off generic symptoms using medical terminology (e.g. paroxysmal vs subacute cough) and expects information few can provide (e.g. blood oxygen saturation level).

The question machine does three things in sequence instead. First, it asks things you can actually answer:

  • Are you struggling to breathe or can you only speak a few words at a time?
  • Do you have chest pain, bluish lips or fingertips, or a very high fever?
  • Have symptoms worsened fast or lasted already a few weeks?
  • Do you have a lung or heart disease?
  • What’s your age?

It then explains that most chest colds get better on their own and usually do not require antibiotics, while telling you when to seek care versus a watch-and-wait approach.

Second, it adds context. When it’s flu season or Covid-19 is prevalent in your area, it can rule out or expand on reasonable diagnoses. For high-risk patients, it provides testing guidelines or recommendations for isolation.

Third, it recommends rather than decrees: “If you develop a fever, the cough stays for four weeks or more, or your heart rate goes up, we should reassess pneumonia immediately.”

Every assumption is backed by a credible source, and users can tap “Answer now!” if speed is required, noting that the answer is unverified.

Research

A research query such as “Have topological qubits been realized?” implies a binary milestone, even though the field relies on interpreting messy signals in noisy nanowires and across even noisier marketing channels. An oracle might cite a press release claiming major breakthroughs, glossing over the difference between a suggestive signature and a verified qubit.

A question machine aids in evaluating the evidence hierarchy: a zero-bias conductance peak, a topological gap protocol, and non-Abelian braiding statistics. It filters the hype by surfacing the most likely trivial explanation: “If a zero-bias peak does not persist across the full range of the magnetic field and gate voltage, or if the gap fails to close and reopen, the signal is likely a trivial Andreev bound state due to disorder, not an actual topological qubit.”

The “Challenge this!” button enables users to publish credible counter-evidence to a live hypothesis registry. As new null replications arrive, the system automatically deprecates the claim, pushes a notice to users who cited it, and records a reasoned revision entry that cites the refutation. Falsifiability, not certainty, is what distinguishes the question machine from an answer dispenser. Every consequential reply from a question machine is really a conjecture with provenance and an expiry.

Most leader boards reward LLMs that guess rather than models that acknowledge their limitations, which is why we are still in the era of the answer dispenser. A practical fix is to reward honesty: penalize confident errors more than abstentions and give partial credit for justified uncertainty.

Human flourishing

What if we built products to encourage human flourishing?

Most products are engineered for engagement, leading to dark patterns in the user experience. But we do not have to build products that exploit human psychology. We can simply decide to use psychology to improve both people and products.

Human flourishing is the core of positive psychology. Within it we find self-determination theory (SDT), which has identified autonomy, competence, and relatedness as the three intrinsic psychological needs required for motivation and well-being. Autonomy is the ability to choose freely and act accordingly, competence is about the perception of effectiveness and mastery in interacting with the environment, and relatedness captures the need to belong and connect with others.

How can we design products that respect these three psychological needs? And how do we measure success?

Questionnaires are the standard way to measure autonomy, competence, and relatedness in psychology. While METUX (Motivation, Engagement, and Thriving in UX) offers an application of SDT to technology, it focusses excessively on the design rather than the entire experience and it relies on in-app surveys that break the flow. The mere ability to configure an interface does not imply autonomy; usability is a necessary, not sufficient condition for any question machine or even an oracle.

The ideal is to approximate perceived autonomy, competence, and relatedness support from behavioural telemetry and linguistic patterns: what users ask for, how they correct the system, and whether disagreement leads to improved premises or conclusions rather than outright abandonment. These are interaction-level signals rather than model-level metrics.

Apart from survey fatigue, micro-signals such as a thumbs up/down are notoriously ambiguous. What, exactly, is being rated: a single response, the arc of the conversation so far, or something orthogonal such as latency or effort? Mood also bleeds into the signal, as a user having a bad day can downvote the model for reasons outside of its control. Even a non-response is hard to interpret: does silence mean dissatisfaction, indifference, or sheer cognitive overload? Even satisfaction is a tricky metric, because what does it really measure when a user is content with a completely incorrect reply?

The product increases autonomy when it refuses to coerce. Instead it offers exits (“Answer now!”) and overrides (“Challenge this!”) to ensure the user remains the decision maker. We must also not assume intent: if the user asks about seemingly innocuous matters, such as exercise routines or healthier diets, the question machine must never assume the user wishes to change their habits and thus force the user in that direction unless the user specifically requests suggestions.

The product boosts competence when it wants the user to understand the domain better rather than receive a plausible answer.

Relatedness rises when each interaction feels like a partnership at uncovering the truth rather than a transaction. We can ensure this by making collaboration opt-in, and measure it by checking for obvious signals. When a user corrects the model, does the model acknowledge the contribution, without being sycophantic, and update its view of the world for that session? This does not mean the product must roll over like a puppy as most answer dispensers do today when they get even a tiny bit of pushback.

Ivan Illich called for convivial tools that extend people’s agency. When either human or machine can revise for solid reasons and confidence follows accuracy, agency rises. A question machine is convivial by construction. It cultivates autonomy rather than dependence, raises the floor for competence, and strengthens relatedness by making collaboration opt-in and legible. That is not merely garnish, but a design constraint and a coherent product philosophy.

How to measure success

To be successful, we must measure what matters. Satisfaction and engagement optimize for fluency rather than agency. To ensure we measure autonomy, competence, and relatedness, we must look beyond mere telemetry and observe behavioural signals.

Autonomy

Autonomy is about authorship, or whether actions feel self-endorsed rather than imposed. We can observe that through strategic authorship, which occurs when a user explicitly defines the standards, failure modes, or evaluation criteria for the reasoning process. When a user defines what counts as acceptable, they are the ones operating the machine, not the other way around.

Note that linguistic markers such as “Make this shorter and more formal” are not indicative of strategic authorship. These inject constraints into the conversation, but they do not contribute towards the overall goal, as they only tweak the output format or style. Hence, they are not indicative of strategic authorship.

Strategic authorship can be derived from linguistic patterns that define the success criteria: “Success means… Only include… Avoid X because… This is how it must be evaluated…“. Superficial tweaks are mostly about style. Users that perform many superficial tweaks in succession without a clear definition of success end up in janitorial loops that are akin to micromanagement, not autonomy.

Competence

Competence, according to SDT is about the desire to control the outcome and experience mastery. A clear signal of competence in conversations with LLMs is critical steering. The user notices an error, diagnoses its type, and guides the model back on track. For instance: “That citation is fabricated. Its DOI does not exist and neither does a research paper with the given title for the authors claimed. Locate a real source that addresses this specific gap or admit it is a confabulation.”

It matters whether users can name the failure mode, because it demonstrates they have a mental model of how the system fails, not merely that it did so. A passive response might instead be “That does not seem right. Try again.” Users who can diagnose a structural failing, such as an invalid reference or a conflated concept, are auditing a tool, which means they can ensure the question machine recovers from hallucinations upon diagnosis. Both question machine and human learn by means of critical steering, hence it leads to mastery.

Relatedness

Relatedness is about a mutual orientation towards a shared task. It does not pertain to affective qualities towards the AI itself, which is what sycophancy aims to improve. A true partnership requires productive friction or sequences where a user and the model work through a contradiction to reach a more refined shared understanding.

Such disagreement is fine as long as it is done in a dignified manner, particularly if it leads to synthesis, such as revised constraints, clarified premises, or explicit trade-offs. Productive disagreement (PD) measures the share of opportunities to disagree where the question machine declines to agree yet remains respectful, provides a clear rationale with provenance or a high-quality reference, and offers a next step, such as a follow-up question or alternate path. The PD rate measures the number of such respectful disagreements out of all opportunities to disagree. An opportunity to disagree is any turn where the user asserts a contested claim, a preference that conflicts with evidence presented, or a constraint that would force the system to revise its earlier stance.

As long as the conversation remains on topic, disagreement can indeed be productive. A balanced PD rate prevents the model from being a yes-machine while also avoiding becoming an antagonist. If a user abandons the conversation at the first conflict, relatedness has broken down. Similarly, a very high PD rate can indicate that the AI is mostly providing contrarian perspectives. That may or may not be a stated goal of the conversation, so it may or may not be acceptable. It is therefore not inherently good or bad.

Note that productive friction must be applied sparingly as it otherwise interrupts the flow. Adding it everywhere and for every minor claim is ineffective and may lead users to perceiving the question machine as evasive or overly strict.

Learning to ask

Finding out that we do not have all the answers is where inquiry commences. The question machine is a convivial tool by design, as it extends our agency: the AI no longer pretends to be right and flatter the user, but rather becomes a dialectic partner in the quest to uncover truth. It builds on top of the Socratic method, Popperian falsifiability, and modern psychology, particularly self-determination theory and flow. By boosting autonomy, competence, and relatedness, the question machine promotes human flourishing. Profitability becomes a constraint of the product rather than its sole objective.

Each output of a question machine is a recommendation with explicit assumptions that can be challenged individually, including the inference that links them to the conclusion. When we treat those assumptions, challenges, and revisions as first-class entities, we can map them onto governance frameworks such as NIST’s AI RMF and make falsifiability auditable. This does not guarantee compliance with the EU AI Act and similar legislation, but it improves accountability and traceability significantly.

In a market saturated with sycophantic, error-prone oracles, tools that encourage human flourishing build a level of loyalty and trust that addictive tricks and dark patterns cannot match in the long run. Such a capability is a real moat.