6 March 2026 • Ian Reppel • 23 min

A Dojo for Human Interactions

What does it take to build an AI rehearsal system where the measure of success is that people stop using it?

Better Half describes itself as building “a rehearsal system where AI plays the other person authentically.” It is a space for practising high-stakes human interactions before they happen in real life. The company’s public materials offer two possible readings of what it is trying to accomplish. Either people show up better in real-life relationships or people feel better about their human relationships. While they sound similar, they are not. The former demands behavioural transfer from AI practice to human encounters, which is extremely hard to prove and even harder to monetize. The latter requires emotional regulation and preparedness, which is much more tractable but far less differentiated.

What follows is a thought experiment, in which I have taken the company’s most ambitious interpretation and followed it to its logical conclusions, not because it is the more likely outcome or right, but because it produces the toughest constraints and the most illuminating product trade-offs. The latter reading is explored briefly at the end. The sources for this product analysis are Better Half’s position paper (email required) entitled Beyond Sycophancy—Relational Homeostasis in Conversational AI, their initial blog post, and the company website as of March 2026.

The short version

Better Half’s blog post claims that “AI alignment is a $200B+ product problem, not a research question.” If the objective is truly human flourishing and not engagement maximization, the product needs novel language, in particular two concepts I call outgrowth (when the user has practised a scenario sufficiently, internalized what they needed, and moved on) and re-entry (when the user returns later for a different scenario, because life demands it).

Reading through the company’s materials, I see its viability resting on four hypotheses and one constraint:

H1: Within-domain transfer works. Does practising with an AI lead to measurable improvement in human interactions over time? If not, the company fails.
H2: People pay for AI rehearsal. Are consumers willing to pay for long-horizon relational improvements? If they are not, consumer is not a viable initial market.
H3: Human flourishing can be defined operationally. Can the system optimize for human flourishing without drifting towards reassurance or gamification? If it cannot, defensibility disappears, as the product morphs into yet another chatbot.
H4: Cross-domain transfer works. Does practice in one context (personal relationships) transfer to another (e.g. negotiation or leadership)? This is only required if enterprise is needed for scale. If not, market expansion is implausible.
C: Thriving over engagement. This is a constraint I would impose to ensure human flourishing remains clearly in focus. If success is defined as continued use, the product will drift towards gamification that optimizes for engagement, such as most therapy or companionship chatbots. Subscriptions optimize for retention, whereas thriving requires less usage per scenario over time.

A product leader might reduce overall risk by sequencing the work to test these hypotheses directly as follows:

30 days (H2): Set up a paid landing page with a fixed set of realistic scenarios backed by prompt-based rehearsal, not the full system described in the position paper. The goal is not to show off the full set of capabilities, but to answer the questions of whether people understand rehearsal and whether they are willing to pay for it.
60 days (H1): Instrument pre- and post-commitment to real-life conversations with a group of volunteers. Afterwards, follow up to compare the effects of rehearsal with whatever prototype exists against scripted prompts and static advice pages. If the outcomes are similar, the entire system is premature in its aims. This is, however, only a directional signal for the hypothesis, not a definitive one.
90 days (H1): Hire psychologists and coaches to act as the system on the same scenarios to compare these expert-led rehearsals against prompt-based rehearsal. The goal is not only to figure out whether the system has any advantage over prompt-based ones, but also to generate proprietary, expert-annotated transcripts for the data flywheel.
120 days (H3): Use self-determination theory to define human flourishing metrics. This is to guard against product drift and to treat outgrowth and re-entry as product success, not failure.

The rest of this article expands on each of these, examines the market reality, and works through the risks. An alternative interpretation that relaxes the transfer requirements (H1 and H4) is explored at the end.

A tricky balance

There are two value systems the product could adopt, and they lead to very different outcomes. What I read as the implied values, drawn from Better Half’s own language, place human flourishing as the objective, commercial sustainability as the constraint, and customer autonomy (i.e. departure) due to increased competence in human relationships (outgrowth) as the success condition. The default, which is how most consumer tech operates, puts profitability as the objective, “don’t do harm” as the constraint, and consumer departure due to lack of value (churn) as a failure condition.

The central question becomes: How do we measure thriving so that outgrowth is seen as a success?

If the company gravitates towards the default—or my reading is too ambitious, then growth and engagement naturally dominate. Consequently, sparring becomes companionship and thriving turns into retention. An analogy from medicine is instructive to illustrate the difference: curing helps people “outgrow” a disease permanently, whereas chronic care improves revenue while making the disease itself tolerable. Relational difficulties are, however, often episodic, nor chronic. Nobody negotiates salary every day and nobody comes out to their parents twice. These are conversations that happen once, maybe a handful of times, and the rehearsal prior to such conversations needs to match that dynamic.

The axiom decides the metrics, incentives, pricing, and long-term product behaviour. It also decides what wins in case of clashes. This is worth stating explicitly, because most companies discover their axiom by accident, usually when it is too late to change.

A new product language

Better Half lets people rehearse important human interactions so they can thrive in real life. If the objective is truly human flourishing, and not merely engagement dressed up in therapeutic kit, then the product needs novel vocabulary.

Consider what human flourishing requires: people become more competent and thus less dependent over time as their relationships improve—autonomy, competence, and relatedness are the core of self-determination theory, to which I shall return momentarily. I call this behaviour outgrowth: through practice a person has internalized valuable lessons about a particular scenario and grown out of it. People have many relationships and life is episodic, so new crises arrive unbidden, which is why they may return for a different scenario entirely. This is what I call re-entry. The rhythm is not the sticky DAU loop that venture capital adores. It is much closer to how people use tax software or physical rehabilitation: intensely, temporarily, and only when it matters. Most software defines success as retention. A rehearsal system must define success as independence. That pattern is rare in consumer tech and the reason why new product language is needed.

Thriving is not an optional moral layer bolted onto a growth engine. It is the only thing that differentiates Better Half from AI companions and therapy bots. Strip it away and the product becomes a chatbot with a personality, a market that is already crowded and trending towards commodity.

That is also what makes it a unique and exciting concept: a product that defines success as the customer’s independence requires a discipline that most companies never dare attempt. If Better Half can pull it off, it will have invented not just a product but a new category of the relationship between software and the people who use it.

The hypotheses, revisited and unabridged

Let’s expand upon the core hypotheses, why each matters, and what breaks if it is false.

H1: Within-domain transfer works (existential)

Practising important conversations with an AI leads to measurable improvement in real human interactions over time. If this hypothesis is false, the company fails. The position paper acknowledges as much:

This requires longitudinal studies comparing rehearsal users against controls.Position paper, p. 10

Such studies may take months or even years. In the meantime, the company must survive on directional evidence—and conviction.

H2: People pay for AI rehearsal (viability)

Consumers value long-horizon relational improvements enough to pay for rehearsal, even if success eventually means they use the product less. If this turns out to be false, consumer is not a viable first market.

The product has a paradox built into its core: it works precisely when people need it less.

The notion that many relationships in people’s lives imply they use Better Half frequently in novel scenarios presumes a dependence that is the very opposite of human flourishing: if you have to consult the AI for every relationship issue, you are no longer living your life but living through a proxy, which is known as social offloading. Autonomy, as self-determination theory informs us, is essential to human flourishing.

H3: Thriving can be defined operationally (differentiation)

Better Half can optimize against stable objectives, such as autonomy, competence, and relatedness, without sliding towards gamification and reassurance. If this hypothesis is false, the flywheel breaks down. Without an operational definition of human flourishing, every optimization loop will quietly revert to engagement maximization, and overall defensibility disappears.

H4: Cross-domain transfer works (scaling)

Practice in one context, such as personal relationships, produces skills that transfer to another context, such as negotiation. This hypothesis is only required if enterprise adoption is necessary for viability or scale. If it is false, market expansion from consumer into enterprise and government becomes implausible. I shall return to this later, because the literature on transfer is not amenable to this assumption.

C: Thriving over engagement

Human flourishing is the primary objective, which means that growth is subordinate. If success is defined as continued use, the product will drift. Through achievements, streams, and progress bars, gamification tends to optimize for engagement, not thriving. Subscriptions optimize for retention, and thriving requires less usage over time for each scenario, even if users shift to different scenarios.

The business model must treat outgrowth as progress, not churn, and enable re-entry without friction. I believe outgrowth by scenario is the success metric. Customers may need the product less over time, and that ought to be celebrated.

The existential question: transfer

Transfer validity is the ultimate question: does practising with AI actually help in real conversations?Position paper, p. 10

Let’s assume it does. In that case, there are three obvious questions we must answer: First, who is willing to pay for long-horizon improvements when feedback loops are measured in months or more? Second, can the company survive regulatory and liability scrutiny at scale, given that it is operating in the vicinity of mental health? Third, if scale requires enterprise, can we be at all confident that cross-domain transfer actually works?

If AI-to-IRL transfer does not happen, the company fails. There is no pivot that rescues a rehearsal product that does not improve rehearsal outcomes. If a theatre group or band rehearses every day and yet after one year they are nowhere near a public performance of any duration, they must admit to their failure and disband.

What Better Half is—and is not

Architecturally, Better Half is orchestration rather than end-to-end machine learning: a frozen base LLM (Mistral), LoRA adapters, relational state variables, and a policy/orchestration layer. It is not a companion or therapist. I think of it as a dojo for human interactions: a place where skill transfer matters more than comfort, where practice matters more than attachment, and where the consumer market is merely the proving ground and data collection engine for enterprise and government:

Create the training data that doesn’t exist.Anastasia Uglova (CEO)

A dojo implies instruction and disciplined repetition under supervision of an AI. This view has consequences that ripple through every product decision. It makes enterprise rehearsals plausible, because sparring is something organizations already pay for. It also removes robotics as a natural extension, because the value lives in the conversational dynamics, not in embodiment. And it quietly rephrases the consumer market as a means to something larger, not the end in itself.

Back to basics

The position paper is candid about what is missing:

What’s missing is the integration layer, scenario orchestration, the CPT [Cumulative Prospect Theory] reward model, proactive initiations infrastructure and validation studies demonstrating transfer to real conversations (PP, p. 10)

The natural instinct is perhaps to build all of this in parallel. A more disciplined approach might ask: How can we strip down the architecture to the bare essentials needed to test transfer, and do that first?

Time, the hardest product problem

The product’s central paradox is articulated clearly:

[…] short-term reward eroding the capacity to tolerate delay, conflict and repairPosition paper, pp. 1–2

Feedback is immediate yet improvement may take months or years. Just look at people who spend years in therapy. So, before any longitudinal evidence exists, what counts as progress?

A thumbs up/down on a session is not necessarily indicative of long-term improvement, or even of the AI’s quality. Users having a bad day or getting a response that confirms their biases might skew the results in ways that are invisible to simple sentiment metrics. In the enterprise, decisions tend to follow quarterly timelines, which is still far beyond easy cause-and-effect measurements.

The evidence on the efficacy of current mental health apps is not encouraging. Torous et al. (2018) found that the field is plagued by low engagement and weak outcome data, and later work by Torous et al. (2019) showed that mental health apps often fail in real life precisely because of these issues and a lack of integration with healthcare systems.

Limits to learning

Better Half’s ambition is to serve three markets with the same architecture: consumer, enterprise, and government:

Primary use cases include […] working through conflict […] negotiating salary […].Position paper, p. 2

One architecture serves all three markets: consumer proves the engine and generates training data. Position paper, p. 9

Transfer is highly context-dependent and requires extensive practice. What would convince the company that practice in one context, say, personal relationships, transfers to another, say, negotiation?

The literature gives reason for caution. Baldwin & Ford (1988) established that transfer from training to real-world settings within a single domain is already unreliable without ongoing support. Ericsson et al. (1993) showed that deliberate practice requires extensive, domain-specific repetition with expert feedback. Barnett & Ceci (2002) found that robust “far” transfer across substantially different domains may not exist in any robust sense. Sala & Gobet (2017) noted that cognitive training rarely generalizes beyond the trained task. Even in machine learning, Pan & Yang (2009) showed that if domains are not sufficiently related, transfer learning may harm rather than improve performance.

None of this means cross-domain transfer is impossible, but it does mean it requires careful validation, not assumptions. Role play works best when it is grounded in validated scenarios, human facilitation, and external evaluation. Whether Better Half plans to hire human experts to validate scenarios and evaluate outcomes is an important open question.

The baseline to beat

Standard LLMs are too agreeable.Position paper, p. 1

True, but a vanilla LLM is not the right baseline. A strong prompt with a well-designed coach or therapist persona and structured reflection is. If the outcomes from a prompted frontier model produce similar outcomes, the entire architecture is unjustified.

Market reality

The $200 billion claim

The blog post claims without evidence that AI alignment is a $200B+ product problem. Worldwide AI spend is projected at $2.5 trillion for 2026, so $200 billion would be roughly 8% of total AI expenditure. For context, global cybersecurity spend is roughly $200 billion, and that market exists because regulation mandates it. The expenditure on cybersecurity did not scale until regulation and liability emerged. The $200 billion figure for AI alignment is only plausible if alignment becomes a mandatory enterprise cost, that is, driven by compliance, not aspiration.

The adjacent consumer markets are considerably smaller. By 2030, projected market sizes are $14.4B for online dating, $17.5B for mental health apps, and roughly $1.4B for AI companions (extrapolated from $120M in 2025 revenue at 64% CAGR). The $200B figure requires high-ARPA enterprise customers, not consumer subscriptions.

Pricing

The position paper actually lists a subscription of $29/month. At that price, reaching $200 billion in consumer revenue would require roughly 575 million paying subscribers, or approximately 62% of all adults in high-income countries, of whom there are about 923 million. Consumer revenue is insufficient for a $200 billion market. The consumer market is thus a staging area and data-generation engine for the enterprise and government markets, in which case H4 is existential, not merely about scaling.

Netflix has more than 325 million paid members with annual revenue near $45 billion. Spotify has 290 million premium subscribers and nearly $20 billion revenue. These are among the largest consumer subscription products ever built, yet neither approaches $200 billion in revenue, not even if we double the monthly price.

Additionally, $29 per month is a luxury price point in many regions, which narrows the addressable population further. The ideal customer profile needs to be resolved early: is it affluent individuals who treat this as a personal-development expense, or is it enterprises who embed it into training budgets?

Product and strategy

Privacy as risk containment

We prioritize […] privacy-preserving training.Blog post

While that is responsible, users rarely choose products because of privacy. Meta, Google, and TikTok dominate despite serial privacy controversies. Privacy-focused alternatives, such as Mastodon or Kagi remain niche. Privacy is ultimately risk containment, not an acquisition driver.

The real trade-off is between privacy and the model’s ability to learn. Federated learning reduces data transfer but adds complexity. Differential privacy offers strong guarantees but weakens generalization. Secure aggregation and multi-party computation provide strong guarantees but introduce latency and operational overhead. Per-user retention limits keep data safe but starve the flywheel. Each choice constrains the product differently, and none is free.

Governance

Every conversation turn [is] logged with [a] prompt, [a] response, [a] state snapshot, [and] session metadata.Position paper, p. 38

With memory, the system can track relational dynamics across sessions, but it also multiplies the privacy governance burden. What is truly opt-in? What is retained, and why? What is deleted, and when? These are questions that are crucial to the product, because the answers shape what the system can learn.

Security is essential

Data on mental health is exceptionally sensitive. A breach here is not an operational issue, but a product failure. If someone’s rehearsal transcripts about repairing a marriage, navigating a breakup, or managing anger leak, the damage is not abstract. Security must be part of the product definition from day one: access controls, breach response planning, threat modelling, insider risk mitigation, and auditability. These are surprisingly absent from the position paper.

Measuring human flourishing

The system learns what helps humans thrive.Blog post

So, how do we define and measure human thriving, especially when it occurs outside of the product itself?

A candidate framework is self-determination theory (SDT), developed by Ryan & Deci (2000), which identifies three basic psychological needs for motivation and well-being:

Autonomy: the ability to choose freely and act accordingly.
Competence: the perception of effectiveness and mastery in interacting with the environment.
Relatedness: the need to connect with others.

SDT has decades of empirical validation across cultures. It explains motivations without defining life goals. It is descriptive, not prescriptive. The system does not decide what a good life is or ought to be, but it supports the conditions under which people can pursue their own goals. I have previously described how SDT can form the basis for LLMs as question machines rather than oracles.

That said, longitudinal validation is still required. SDT provides the vocabulary and the measurement instruments, but someone still needs to run the studies.

Defensibility

Better Half aims to make Cumulative Prospect Theory a core novelty:

Novel contribution: CPT-aligned decision making.Position paper, p. 6

Unfortunately, the architecture alone is not a moat. Consider the parallel in weather forecasting: frontier model architectures from Google, Microsoft, Huawei, NVIDIA, and startups like Jua are often similar in performance and fiendishly hard to keep secret at the speed and connectivity of the modern ML community. Even data is no longer a moat, unless it is proprietary.

The plausible moats for Better Half are proprietary longitudinal data, which takes years to accumulate, institutional partnerships, which require trust and track record, and rigorous evals, which require the operational definition of thriving from H3. The business model itself, if it embraces the outgrowth-and-re-entry logic I described earlier, may turn out to be the deepest moat, precisely because it is so counter-intuitive that competitors will not copy it. Few companies dare optimize for human flourishing rather than profit.

Analytics from day one

If consumer is the initial market, telemetry must be available from day one without bespoke ETL or a large data team. The recommended pattern is an analytical twin of production systems and log stores for real-time SQL analyses, such as KPI tracking including cohort analysis, monitoring of conversation quality and harm signals, detection of performance regressions, and the ability to correlate user behaviour with reliability events, such as churn following specific exceptions.

Telemetry ought to be minimal and structured. Such data discipline is rare in startups, but from my own experience it is much harder to retrofit a real-time data platform than to integrate it immediately. Raw text retention and training use must be governed separately, which ties to the governance questions above.

An analytical twin might consist of service events flowing through Kafka into Pinot or Feldera, with database changes captured via CDC (e.g. Debezium) into the same pipeline. The wire format should be Avro or Protobuf for typed schema evolution that is backwards-compatible. JSON lacks the schema enforcement needed for production analytics.

Risks

Fundraising

VCs are the obvious choice for fundraising for most AI startups, and they raised $89.4 billion in 2025 indeed, but when growth-at-all-costs is not a viable business model, how can Better Half raise the necessary funds to build a consumer product? Better Half is, initially, a consumer bet with weak monetization signals and slow proof. That is a difficult pitch to most venture investors. The transition from grants to revenue needs careful choreography to avoid the product being warped by the demands of whichever funding source materialises first. This places a premium on founder credibility and a narrative that can attract “patient capital” or non-dilutive grants to bridge the gap to initial revenue streams.

AI alignment funding is currently less than 0.3% of AI startup VC funding. Available non-VC research grants for alignment are modest: Coefficient Giving aggregates over $200 million in 2025 across various funds, but the directly accessible grants are much smaller. The Alignment Project offers $20 million, Superalignment $10 million, the AI Safety Fund $5 million, and Conjointly $14,000. The counterexample of Safe Superintelligence (SSI) raising over $1B does not really count, because that is because of founder credibility (Ilya Sutskever), not market validation.

The name

“Better Half” is consumer-coded, intimacy-tinged, and already taken twice. Betterhalf is an Indian AI-powered matrimony app, and bttrhlf.com is a marketing agency with tech customers. Enterprise customers may find it a curious name for a professional training tool, though the sparring-partner framing softens this somewhat. SEO will be an ongoing headache, though.

Thriving versus engagement

The language on the website clashes with the position paper’s ambitions:

Feels like a relationship. Works like a game.Website

The same sentiment is observable in a LinkedIn comment by one of the co-founders:

It’s basically a game engine that incentivizes pro-social outcomes.Anastasia Uglova (CEO)

SDT shows that relational competence depends on intrinsic motivation (i.e. agency, judgement, and meaning). Gamification shifts motivation from intrinsic to extrinsic, though. Relatedness and competence are undermined when behaviour is driven by external rewards. A progress bar for your relationship health is counterproductive to the goal of authentic human connection. Game mechanics push people towards metric optimization rather than personal growth. Rehearsal thus becomes a performance, and trust erodes once behaviour feels nudged rather than self-directed.

So, how do we encourage repeated practice without external rewards? For humans to flourish, outgrowth is expected.

Rehearsal versus chat

We’re not building a friend. We’re building a sparring partner.Position paper, p. 9

If the product lives in a chat interface, the default user interpretation will always be chatting, venting, or emotional support. Rehearsal requires pushback, and pushback only feels constructive when people expect it. If they come expecting a sympathetic yet virtual ear and receive a challenge, the experience is not merely unpleasant, but it could become a safety risk. Such misinterpretation cannot be solved with disclaimers. The product must act like a rehearsal:

Scenario selection
Explicit role declaration
Time-bounded session
Reset and retry affordances
Reflection framed around what was learned rather than how it felt The UX has to make the mode unmistakable.

The re-entry problem

Many real-world needs for rehearsal are temporary: preparing for a difficult conversation, managing a short-term leadership challenge, navigating a conflict, or negotiating salary. Some are longer-term (e.g. social anxiety reduction) but even these tend towards episodic intensity rather than permanent dependency.

Thriving can imply a successful exit from one scenario followed by eventual return in another. This is the outgrowth-and-re-entry pattern, in which a healthy product may reduce its own usage. Perhaps more appropriate are time-bounded programmes (“prepare for X” or “repair Y”), not indefinite improvements. Pricing and metrics must not punish successful churn-and-return when life demands it.

A subscription model assumes an ongoing dependence. The analogies for churn-and-return are illuminating: tax software, exam preparation, physical rehabilitation, or fertility treatment. Nobody wants these forever, but they pay when it matters. If re-entry is a real pattern, and there is good reason to believe it is, the pricing model must accommodate it cleanly rather than pretending that every cancellation is a failure.

The obvious counterargument to outgrowth is that people have many relationships: children, spouses, neighbours, parents, coworkers, bosses, exes, the parents of their kids’ friends, and so on. New frictions surface constantly, so there will always be another conversation to rehearse, which could sustain ongoing usage without the product needing to manufacture engagement. Most interactions are, however, not high-stakes and therefore do not require rehearsal. And if someone begins practising every conversation before it happens, the product has become a crutch rather than a genuine aid. In that case, the person is not developing competence, but they are developing dependency on a preparatory ritual. That is dependency-as-engagement, which is exactly what the constraint (C) is meant to prevent.

There is a deeper problem too. Running every conversation through an AI rehearsal first erodes something important: authenticity. Part of being a capable human is making honest mistakes, stumbling through an awkward apology, saying the slightly wrong thing and recovering in real time. Personal growth requires occasional friction and embarrassment. A person who pre-screens every interaction is not someone who has become more relationally competent. Rather they are someone who has learned to avoid the very friction that is needed for growth.

Pushback and harm

Our system […] produces authentic pushback.Position paper, p. 9

Rehearsal without resistance is flattery, but confrontation can worsen distress in vulnerable customers. How do we calibrate pushback? And do we have a fail-safe when pushback escalates to serious harm? These are important design requirements.

Attachment

[The system] maintains presence across time, not just within turns.Position paper, p. 6

Human-like conversational agents increase emotional attachment, as Skjuve et al. (2021) demonstrated. How do we prevent unhealthy dependence or substitution? If the system is good enough to feel like a trusted confidant, some users will prefer it to actual humans. That is not thriving at all.

The 30/60/90/120-day plan, again

Let’s revisit the reasoning behind each phase mentioned earlier. Running alongside the plan are 10–15 qualitative interviews to map mental models, trigger moments (e.g. breakups, job changes, leadership conflict, negotiation, managing anger or anxiety in specific contexts), and rehearsal misinterpretation risk. These interviews are explicitly not to be used to validate demand or outcomes. They exist to understand how people hear the product:

Do they hear “sparring partner” or “emotional support”?
Do they expect reassurance, advice, or challenge?
What language triggers defensiveness or relief?
How much pushback feels acceptable before challenge becomes threatening?
Which phrasings feel judgemental, gamified, or manipulative, and which ones preserve agency and dignity?

Interview prompts for trigger moments might include:

“Tell me about the last conversation you avoided.”
“What happened afterwards?”
“What did you try to prepare, if anything?” These map the emotional territory the product will inhabit.

Interviews are not to be used for pricing, because revealed intent always beats stated preference, especially for aspirational products.

This reinforces the “de-risking” nature of the plan.

30 days: H2 — Do people understand rehearsal, and are they willing to pay?

Launch a paid landing page with a fixed set of scenarios, backed by prompt-based rehearsal and structured reflection. Part of the landing page is also the positioning, ensuring customers understand Better Half as rehearsal, not therapy and not companionship. The idea is to practise difficult conversations before they happen, conversations you don’t want to get wrong.

If there is no intent to pay, the decision can be to either double down on consumer or pivot.

Rehearsal is most valuable i) when the conversation is high-stakes, ii) when a negative outcome is irreversible or costly, and iii) when failure is hard to recover from.

Consumer scenarios span irreversible conversations (e.g. confrontation, reconciliation, resignations, salary negotiations, boundary-setting with parents, coming out, disclosing a medical diagnosis, asking for forgiveness, ending a relationship), emotional regulation in context (e.g. anxiety before a public presentation, shame before a repair attempt), relationship repair, conflict avoidance, and career firsts (e.g. first time managing people, announcing the first layoff, first board conversation, first time negotiating equity, first time pushing back on a superior).

Enterprise and government scenarios include professional training (e.g. healthcare communication such as breaking bad news, social workers, ombuds roles, crisis lines), public-interest institutions (e.g. courts, family mediation services, immigration, asylum interviews, public defenders' offices), corporate risk management (e.g. executive conflict, terminations, whistleblower conversations, union negotiations), and education (e.g. clinical psychology programmes, law schools for client interviews and depositions, business schools for negotiation and feedback).

A fixed set of scenarios can be offered at a fixed price. If users repeatedly ask for new variations, edge cases, or role reversals, that is evidence they value practice, not novelty or reassurance. If they churn after one use, the market is tiny.

60 days: H1 — Is there any evidence that transfer might work?

Early customers must commit to IRL interactions before and after rehearsal, with follow-ups at one week and three weeks after. Controlled comparisons are needed between scripted prompts with reflections, and static advice content. If the outcomes are similar across conditions, the prototype is not yet fit for purpose. This is why it is crucial to simplify the architecture as much as possible, because directional evidence of transfer is essential to continue. If there is no directional signal for transfer, the company might decide to continue anyway because they believe the prototype is not representative or focus on a gentler interpretation instead.

90 days: H1 — Is there consistency across customers and contexts?

Let psychologists and coaches act as the system on the same fixed scenarios. Compare expert-led rehearsals against prompt-based rehearsal with reflection. Such a concierge MVP also generates high-quality, expert-annotated rehearsal transcripts that are proprietary, which is valuable fuel for the data flywheel.

120 days: H3 — Can human flourishing be operationalized?

When success occurs outside the product itself, the best proxies are outgrowth and re-entry. SDT provides the framework to define metrics around these. Sixty days of data with an early prototype and thirty days with experts, including follow-ups, are sufficient to notice outgrowth and re-entry for basic scenarios. After all, if people wish to negotiate salary or repair a relationship, that is generally more time that they are afforded in real life.

Note that these de-risking steps do not prove within-domain transfer (H1) or cross-domain transfer (H4), but they do reduce the probability that the company is building on false premises, which is the most useful thing a 120-day plan can do. Longitudinal proof may take months or even years. This plan buys time and direction, not certainty.

What success depends on

Success depends less on architectural novelty than on five things:

Objective definition: What does thriving mean, operationally?
Evaluation discipline: Are we measuring the right things, over the right time horizons?
Market sequencing: Does cross-domain transfer enable or block it?
Harm containment: What breaks when the product is misused or misunderstood?
Data security: Breaches are product failures, not merely PR problems.

Most AI products optimize for engagement, because engagement is easy to measure and easy to sell to investors. A product that focuses on human flourishing is fundamentally different, which is why it requires new language, new metrics, and a kind of institutional patience that is rare in venture-backed startups.

If Better Half is to be provided through an API, there is also the possibility of the system being repurposed for use cases that were never intended: artificial influencers on social media, NPCs in games, pornography, or worse. Foreseeable abuse is a product concern.

Open questions

There are several problems that resist clean answers.

How tolerant are people of high latency during emotionally charged moments, given the multi-component architecture?
How important are tone, timing, and body language for human-interaction rehearsals, and how do we avoid the false confidence that text-only practice might breed?
How far can and should the system go in personality injection? Some crucial conversations require facing hostile or manipulative counterparts, and rehearsal is most valuable when the other person is not cooperative, but simulating that safely is a hard, open problem.
What prevents adversarial poisoning or manipulation of longitudinal data?
How can Better Half ensure that upstream model providers remain compatible with the product’s ethical commitments and economic model as their terms and pricing evolve?

A gentler interpretation

What if Better Half is not primarily about transfer or long-horizon behavioural change, but instead about feeling good enough? Call it experiential rehearsal: emotional regulation, preparedness, confidence, without any guarantee of real-world outcomes.

Under this reading, the hypothesis stack simplifies considerably: H1 becomes aspirational rather than existential, and H4 becomes entirely irrelevant. The constraint C becomes a design principle or vanishes altogether. Success means customers feel more capable, not demonstrably act differently. What remains are the risks: attachment, pushback calibration, privacy, security, and the danger of drift towards companionship.

This interpretation has a significant upside: enterprise customers already pay for courses and seminars with no guaranteed outcomes. Ultimately, the enterprise market does not particularly care about people becoming better humans. The AI must be good enough, cheaper, and more scalable to be relevant.

For comparison, what follows are projections of the enterprise market for 2030:

Corporate training: $541 billion.
Soft skills: $63 billion.
Leadership development: $164 billion.
Negotiation training: $3.6 billion. These are real markets with real budgets, and they do not require anyone to prove that a two-hour workshop on active listening permanently altered someone’s long-term behaviour.

Relevant public data sets

No large-scale, outcome-labelled data sets exist for longitudinal “healthy relating” across domains. What does exist is fragmented across mental health, business, and AI alignment.

Mental health

DAIC-WOZ: distress analysis interview corpus
Anno-MI: expert-annotated counselling dialogues
MEDIC: multimodal empathy dataset in counselling
CounseLLme: simulated mental health dialogues
PsyDial: conversational mental health support
ScAN: suicide attempt and ideation events
MentalChat16K: conversational mental health assistance
EmpatheticDialogues: crowd-sourced conversations in emotional situations
Woebot clinical trial data
mental-health-datasets

Business

DealOrNoDeal: negotiations
MultiWOZ: task-oriented dialogues

AI alignment

Anthropic Constitutional AI
Google BIG-bench
Stanford HELM

Outlook

If my interpretation of Better Half’s ambitions is correct and they indeed seek to optimize for human flourishing with commercial sustainability as a key constraint, that is a rare and tricky product worth building. The implications are far reaching and may point us towards a more ethical use of psychology in product development instead of the default in which dark patterns dominate. I have recently explored that in the context of LLMs and intend to expand upon it in the future.

A virtual safe space to practise conversations that only happen once is an interesting idea. There is after all no second chance at a first impression, so it might be sensible to start at zero instead.

Building a dojo for human interactions means the ultimate victory is an empty dojo. It is a product philosophy that flies in the face of every engagement metric in Silicon Valley, and that is what makes it not just a fascinating product problem, but a potentially important one.