Practice Health: A Better Way to Evaluate AI Tools

Most AI tools are evaluated on what they produce. Is the image good? Is the text accurate? Is the output novel? These are reasonable questions, but they miss something important — whether the person using the tool can sustain a working relationship with it over time.

I spent four years studying how creative practitioners (artists, filmmakers, designers, musicians) actually live with generative AI… weeks and months of daily use. What I found is that the tools that produce the best outputs are not necessarily the tools that support the best practice.

The problem with output metrics

When we evaluate AI tools by their outputs, we assume that better outputs mean better tools. But creative work isn’t a single generation event. It’s an ongoing practice: prompting, selecting, discarding, returning, revising, contextualising, and deciding what’s yours.

A tool that generates stunning images in seconds can still erode the practice around it. How? By flooding judgement with options. By compressing the time needed to evaluate. By making it impossible to reconstruct how a piece came to be. By updating silently so that yesterday’s reliable workflow produces different results today.

The practitioners I studied didn’t abandon AI tools because the outputs were bad. They struggled because the tools didn’t support the work around the outputs: the pacing, the provenance, the sense of authorship that keeps creative work meaningful.

What I mean by practice health

Practice health is a set of questions you can ask about a tool relationship over time:

Can you reconstruct how something came to be? If you made something last week using AI, can you trace what you prompted, what the model contributed, and what you changed? Or has it all scrolled away into an undifferentiated feed?

Is your vocabulary stabilising? When practitioners develop a working relationship with an AI tool, they develop shorthand — prompt patterns, evaluation criteria, ways of describing what they want. If that vocabulary keeps shifting without settling, the practice isn’t taking hold.

Do parked ideas resurface? Healthy creative practice involves deferral — setting things aside and coming back later with fresh eyes. Tools that don’t support pausing, parking, and returning force every decision to be immediate, which degrades judgement.

Does authorship still feel like yours? This isn’t about copyright. It’s about whether the practitioner can point to their own contribution and articulate why the work is theirs. When AI compresses the labour of making so far that the human contribution becomes invisible – practice health suffers.

Where this came from

In my PhD research, I followed one artist over ten weeks as she adopted generative AI for the first time. She was an experienced practitioner encountering these tools with no prior AI experience.

What I watched was not “learning to prompt.” It was learning to live with a new member of the studio ecology. She developed temporal rhythms: short bursts at the screen, then off to the bench to sketch or paint, returning days later to re-evaluate. She invented parking practices: setting aside images she couldn’t judge yet and coming back when her perspective had shifted. She developed comfort controls: personal rules about when the AI was contributing too much and needed to be dialled back.

Items she parked and returned to after a few days had survival rates of 40-60%. Items she judged immediately during generation sessions survived at 10-25%. The pause wasn’t procrastination. It was the practice working.

Why this might matter for product teams

If you’re building AI tools, practice health gives you different questions to ask:

Instead of “how quickly can users get an output?” ask “how quickly can users get an output they’d actually use?” In my research, time-to-first-output and time-to-usable-output were very different numbers.

Instead of “are users generating more?” ask “are users keeping more?” Volume without curation isn’t productivity. It’s a flood that erodes judgement.

Instead of “do users come back?” ask “do users come back and pick up where they left off?” Retention is meaningless if every session starts from scratch because the tool has no memory of what happened last time.

Instead of “is the output good?” ask “does the user know what they contributed?” Authorship clarity isn’t a philosophical nicety. It’s what keeps people invested in the practice rather than feeling like an operator of someone else’s system.

Practical signals

If you’re a researcher evaluating AI tools, here are four lightweight metrics I developed:

Spark-to-paper latency — How long between generating something and it appearing in real work (a document, a design, a presentation)? Long latency might mean the tool’s outputs aren’t usable without significant human work. Very short latency might mean outputs are being accepted uncritically.

Review survival — Of items set aside for later review, what percentage eventually gets used? High survival suggests good judgement in what to defer. Low survival suggests the tool is producing volume without value.

Comfort moves — How often does the user adjust parameters, switch tools, or change approach mid-session? Some adjustment is healthy exploration. Constant adjustment suggests the tool isn’t meeting expectations.

Vocabulary stability — Is the user’s prompt language (or search language, or instruction language) settling into patterns, or is it thrashing? Stability suggests a working relationship is forming.

None of these require invasive logging. They can be observed through diary studies, session recordings, or lightweight telemetry. They describe what’s happening without prescribing what should happen.

The bigger picture

AI tools are entering every domain of knowledge work. Writing assistants, coding copilots, design generators, research tools. In each domain, the same questions apply: can people sustain a healthy working relationship with these tools over time, or do the tools erode the judgement, authorship, and rhythm that make work meaningful?

Output quality will keep improving. That’s the easy part. The hard part is designing tools that support the practice around the outputs — the pausing, the returning, the tracing, the deciding what’s mine. That’s where the real product differentiation will be.

—

Practice Health: A Better Way to Evaluate AI Tools

Share this:

Like this:

Discover more from Charlotte Bird