The Context Audit: Three AIs, 90 Days of Custody, One Reveal (Case File #036)
The Context Audit: Three AIs, 90 Days of Custody, One Reveal (Case File #036)
A controlled experiment was conducted over ninety days. One AI tool per day, three tools rotated in thirty-day intervals. Notion AI, then ChatGPT, then Claude. Same workflow profile. Same forty-hour week baseline. Twelve hundred prompts logged. The headline finding was the output-quality verdict. The buried finding was which tool the operator reached for most often, and why. The verdict was not the one the experiment design predicted. The case file documents how the asymmetric information posture, traced through prior Fragment Zero case files, predicted the actual outcome.

The audit parameters were as follows. Ninety days. Sixty dollars in subscription cost total. No other AI tools permitted in any of the three operating windows during the test. Every prompt logged. Every output rated. Every moment of cross-tool temptation noted in the evaluation log. Three measurement criteria: which tool the operator reached for most often, which produced the highest output quality, which felt fastest in real use. The hypothesis at experiment start was that a single tool would win across all three. The hypothesis did not survive contact with the data. Three different tools won three different categories. One of the winners was not the one the operator would have predicted.

Days one through thirty: Notion AI. Plus tier with the AI add-on, twenty dollars a month. Day one was the strongest performance window. Notion AI operates inside the operator's existing workspace, where meeting notes, project documentation, and historical email content are already stored. The capability to query the operator's own archive and receive a three-second answer with citations to the original document is something the other two tools cannot match. Day seven was the failure event. The operator attempted to use Notion AI as a long-form writer for a blog draft. The output rated 1.2 on the 5-point internal quality scale. Generic, repetitive, structurally indistinguishable from a SaaS landing page. By day fifteen the tool was filed under a single use case classification: searching the operator's own knowledge base. Pattern is consistent across forty-six tested writing tasks. The next twenty days were unproductive for any task outside that classification.

Days thirty-one through sixty: ChatGPT. Plus subscription, custom GPTs enabled, twenty dollars a month. The first week was the strongest performance window. Every short repetitive task that had previously consumed manual cycles, email rewrites, meeting prep, brainstorming variations, was reassignable to a custom GPT. As documented in the previous Fragment Zero case file on custom GPTs, the pre-compiled context model produces a measurable throughput lift. The bounded finding was speed: ChatGPT consistently produced short-task responses in eight to fifteen seconds, materially faster than the other two tools in real use. The failure mode was long-context reasoning. Coherence degradation was observable by paragraph three on five-thousand-word inputs. On a sales call transcript with twelve subtle objections embedded, ChatGPT surfaced only the three obvious ones. The classification logged: sprinter, not marathoner. Twenty days into the period the operator was already routing long-form work to the next tool in the rotation, against protocol.

Days sixty-one through ninety: Claude. Pro subscription, twenty dollars a month. The capability under evaluation: long-context reasoning. A fifteen-thousand-word document was pasted and queried for the three real arguments hiding under the polite language. The response was directly usable without modification. Claude's writing did not present as AI-generated under blind review. Edits respected the operator's existing voice. The output held coherence across multiple sections. Claude Projects with custom instructions and knowledge files filled approximately seventy percent of the role that ChatGPT custom GPTs serve, sufficient for the experiment's purposes. The trade-off, logged: Claude was measurably slower for short tasks, and lacked the polished custom-GPT marketplace ecosystem. Output quality: highest of the three, by a margin. Usage frequency: not the highest. The case file documents why.

The controlled head-to-head test. Same input on the same day across all three tools. The task: a customer call transcript, extract the three real objections, draft a follow-up email addressing each. Notion AI completed in eight seconds, surfaced decent objections, drafted a generic email. ChatGPT completed in twelve seconds, surfaced three surface-level objections, drafted an email containing identifiable AI tells. Claude completed in twenty seconds, surfaced an objection the other two tools missed entirely, drafted an email rated as send-ready without modification. On this task, Claude won output quality cleanly. Pattern is consistent across the broader sample. But a single task is one data point. The full picture, documented across the ninety-day log, is more uncomfortable for the experiment's initial hypothesis.

Verdict one: output quality. Claude. The margin was not close. For any task category where the response had to hold coherent thought across multiple sections, Claude produced outputs the operator did not substantively rewrite. The other two required cleanup. Claude required approval. The implication, logged in the audit: for any operator whose deliverable is the writing itself, Claude is the long-form reasoning subscription. Long-form documents, sales call analysis, strategy memos, edits to the operator's existing writing. The classification persists across the audit's wider sample. As documented in the Mirror Core case file, the operator's own voice is the training data that distinguishes acceptable AI assistance from contamination. Claude was the only tool in the test that consistently respected that boundary.

Verdict two: speed in use. ChatGPT. The margin was not close. For short repetitive tasks under five hundred words of output, ChatGPT averaged eight seconds per task across the audit. Claude averaged sixteen seconds for the same workload. The pre-compiled context pattern, as documented in the previous Fragment Zero custom-GPT case file, drops ChatGPT's effective response time to approximately four seconds because the context the other tools must receive at every prompt is already loaded into the agent's working memory. The classification persists across the audit's throughput sample. For any operator whose bottleneck is short-task volume rather than depth, ChatGPT is the throughput subscription. Critically: the speed advantage is enabled by the custom GPT having received and retained the operator's context once, then operating from that retention. The convenience is enabled by the retention posture.

Verdict three: usage frequency. Notion AI. The margin was not close, and was not the result the experiment design predicted. Across the ninety-day period twelve hundred prompts were logged. Notion AI received four hundred and fifty of them. ChatGPT received four hundred and ten. Claude received three hundred and forty. The reason, documented in the audit log: Notion AI is the only tool in the test set that already knows the operator's context without an explicit upload every time. Every Claude prompt and every ChatGPT prompt begins with the operator re-explaining who they are, what project they are on, which document they are referencing. Notion AI does not require that step. The friction is zero. As documented in the Memory Market case file, the data does not stay confined to its account. The flip side of that posture is what made the tool feel reach-for-it convenient: the operator's context was already inside the system, retained across sessions in ways that no consent dialog had explicitly itemized. Zero-friction tools are reached for more often than higher-quality tools that require setup. The buried finding behind the verdict: output quality matters less than retention-derived convenience.

The buying decision matrix derived from the audit, for the operator who must subscribe to only one. If the work product is writing itself, books, articles, strategy documents, contracts, Claude. If the work product is throughput, replies, brainstorms, quick edits, ChatGPT. If the work already happens inside Notion and the workspace contains a meaningful operator knowledge base, Notion AI, with full awareness of the retention posture documented in this case file. If the budget supports two, the pair is Claude plus Notion AI. Quality plus retention. ChatGPT becomes optional in that configuration. If the budget supports all three, as the operator in this audit did, the rotation pattern documented across this case file is the configuration that emerges from data. Each one wins at one thing. Each one logs interactions in ways that should be documented and reviewed.

The audit log is complete. Twelve hundred prompts categorized, three subscriptions evaluated, three different winners across three different criteria. The case file documents one operator's experiment. The retention posture documented for each of the three tools has not been modified by the vendors as of this writing. The same retention posture applies in the test subject's own configuration as it applies in yours. The convenience of Notion AI's three-second context-aware answer is enabled by the same system surface that this audit documents. The case file does not close. It waits. Run the same prompt against three AIs. Compare the outputs. Submit the anomalies to fragmentzero.net/echo.