Rethinking AI Beyond Models, Benchmarks, and Prompts

The Generic Answer and the Wrong Conclusion
A person opens an AI chatbot, enters a brief question, and receives an answer that is clear, competent, and strangely familiar. The language is polished, but the ideas feel predictable. The response could have been written for almost anyone. The user closes the window and concludes that artificial intelligence remains generic, uncreative, and unable to produce genuinely original thought.
This conclusion is understandable because people naturally judge AI through the answers they can see. What remains invisible is everything the system did not know when it produced the answer. It may not know why the question matters, what the user has already considered, which arguments have been rejected, or what kind of audience the answer must eventually address. It may not know whether the user is preparing a corporate proposal, developing a personal essay, testing an intellectual hypothesis, or simply satisfying a moment of curiosity.
When so much is missing, a general answer is not surprising. A model that lacks sufficient context must rely on broadly applicable patterns and assumptions likely to suit many users. The resulting response may be accurate and useful, but it will rarely feel personal or intellectually distinctive. Its generic quality may reveal less about the limits of the model than about the thinness of the interaction.
The same principle applies to human relationships. A skilled adviser meeting someone for the first time will often begin with broad observations. A physician cannot offer a meaningful diagnosis without learning the patient’s history. An editor cannot fully preserve a writer’s voice after reading a single paragraph. Intelligence does not disappear in these situations. It remains constrained by the absence of relevant knowledge.
This distinction is often lost when we treat model capability and interaction quality as though they were identical. We expect a powerful model to produce an exceptional answer to every prompt, regardless of how vague or disconnected the request may be. When the response disappoints us, we conclude that the technology lacks depth. Yet intelligence in practice depends not only on what a system can do, but also on the conditions under which its capabilities are used.
A person who has collaborated with an AI for months may receive a different kind of answer from someone using the same model for the first time. The long-term user may have established a preferred tone, a shared vocabulary, a record of earlier arguments, and a recognizable pattern of thought. The AI may understand which distinctions matter, which explanations are unnecessary, and which tensions remain unresolved.
The model has not necessarily become more intelligent in an absolute sense. The relationship has become more informative. Its existing capabilities can now be directed with greater precision because the AI is no longer responding only to a prompt. It is responding within a history.
This also changes how we should think about creativity. Creativity does not appear only through spontaneous novelty. It can emerge through continuity, reinterpretation, combination, and the gradual development of ideas across time. An AI that understands a person’s intellectual history can participate differently from one answering an isolated command, even when both interactions use the same underlying model.
Before deciding that AI is generic or uncreative, we should therefore ask whether the conditions for a more specific collaboration have been created. The quality of the response may depend not only on the intelligence inside the model, but also on the intelligence that has begun to form between the model and the user.
Why We Mistake Files for Understanding
The popularity of Codex, Claude Code, Cowork, and similar agentic systems reflects a genuine need. Users want AI to work with persistent knowledge rather than depend entirely on the immediate conversation. They want it to inspect repositories, consult documentation, revise saved drafts, follow project standards, and continue work without requiring every detail to be explained again.
This is a major improvement over the isolated prompt. A coding agent with access to a repository can understand how a software system is organized. A writing agent can consult an editorial guide and previous drafts. A corporate assistant can review product specifications, meeting records, research materials, and internal policies before producing an answer. Because this knowledge is stored externally, it can be inspected, corrected, shared, transferred, and preserved.
Recent frontier models provide a concrete illustration of how external memory can affect performance. Anthropic reported that Claude Fable 5 could remain focused across millions of tokens and improve its work by consulting its own notes. In the deck-building game Slay the Spire, persistent file-based memory improved Fable 5’s performance three times more than it had improved Claude Opus 4.8, and Fable reached the game’s final act three times more often.
A game is not equivalent to a corporate or scientific project, but the example reveals an important principle. An agent that can preserve observations, recover earlier conclusions, and learn from its own record operates differently from one that begins again with each prompt. The files do not make the underlying model intelligent by themselves, but they allow intelligence to persist through time.
These advantages have encouraged the belief that files are the real memory of AI agents, while chat systems remain limited by short or unreliable memory. There is some truth in this view. Files provide persistence and make important knowledge visible. However, they address only one part of the larger problem.
Project files describe the work. They may contain:
- requirements, standards, and procedures;
- previous decisions and supporting evidence;
- company terminology and product information;
- project status, constraints, and unresolved tasks.
Personal context describes something different. It includes how a particular user reasons, what tone they prefer, how they organize an argument, which explanations they find useful, and which questions repeatedly concern them. These two forms of knowledge can support one another, but they are not interchangeable.
A repository may contain everything needed to maintain a software product while revealing almost nothing about the engineer directing the work. A corporate knowledge base may explain a company’s strategy without explaining how a particular manager evaluates risk. A style guide may describe sentence length and punctuation preferences without capturing the deeper development of a writer’s thought.
Files are especially effective for knowledge that should remain stable, auditable, and transferable. When an employee leaves, the organization should retain its project documentation. When a new agent enters a workflow, it should be able to consult the same standards as the previous one. This kind of continuity belongs to the work rather than to any single participant.
Personal understanding develops through a different process. It grows through repeated interaction, correction, disagreement, clarification, and shared experience. Over time, an AI may recognize that a user often begins with a concrete observation, moves toward a philosophical distinction, and then asks about practical consequences. This pattern may never have been formally documented, yet it can shape the quality of future collaboration.
A user could create a Markdown profile describing their preferences, interests, responsibilities, and writing habits. Such a file would be useful, particularly when moving among different AI systems. Still, it would remain a representation of the relationship rather than the relationship itself. It might record established preferences without capturing how they developed, when exceptions matter, or why a particular decision was made.
The dependence on skill files and project documents therefore reveals both progress and limitation. Files solve the problem of persistent knowledge more easily than the problem of accumulated understanding. They are not inferior to personal memory, nor are they a replacement for it. They serve a different purpose.
The confusion begins when all forms of context are treated as interchangeable. An AI may need project knowledge to complete a task, organizational knowledge to respect policy, and personal knowledge to preserve voice and continuity. The challenge is not simply to provide more information. It is to provide the right information for the particular relationship and role.
From Prompt Engineering to Context Management
Prompt engineering became important because generative AI systems were highly sensitive to wording. Users learned to specify roles, audiences, formats, constraints, examples, and desired levels of detail. A carefully constructed instruction could produce a far better result than a vague request.
That practice remains useful. Clear instructions help both humans and machines. Yet prompt engineering assumes a relatively direct relationship between the user and the task. The person explains what should be done, the AI produces an output, and the person evaluates it. Even when several revisions are required, the structure remains conversational and immediate.
Agentic systems change this arrangement. Instead of requesting one response, the user may assign a goal. The AI can inspect files, use tools, create intermediate outputs, test its work, identify errors, and revise the result. The task becomes a process rather than a single exchange.
Loop engineering extends this further. Several agents may perform different functions, with one drafting, another reviewing, and another testing the result against the original requirements. The user no longer participates in every step. Instead, the user designs and supervises the system through which the work is completed.
In this environment, the user increasingly resembles a manager. A good manager does not personally perform every task. The manager defines objectives, assigns responsibilities, provides resources, establishes review points, resolves conflicts, and decides when the work is ready. The quality of the result depends not only on the ability of each worker, but also on the structure connecting them.
The same is true for AI agents. A powerful model placed inside a confused process may produce unreliable work. A less capable model operating within clear roles, trusted sources, meaningful checks, and controlled permissions may perform better. Intelligence at the system level is partly organizational.
Recent model development makes this shift easier to see. OpenAI’s GPT-5.6 Sol introduces a ultra mode that uses subagents to accelerate complex work, rather than depending only on a single reasoning process. OpenAI also reports that Sol established a new state of the art on Terminal-Bench 2.1, an evaluation of command-line work requiring planning, iteration, and tool coordination.
This does not mean that every multi-agent arrangement will be reliable. It means that capability is increasingly expressed through sustained processes rather than isolated answers. A model may plan, divide work, consult tools, revise outputs, and coordinate several lines of reasoning. The quality of the surrounding system therefore becomes inseparable from the quality of the model.
The central question changes from how a prompt should be phrased to how the work should be arranged:
- Which agent needs access to which documents?
- What information should persist across sessions?
- Which decisions require human approval?
- When should uncertainty be escalated?
- Who is accountable when several agents contribute to an error?
These are questions of management and governance rather than wording alone. They also explain why personal intimacy is not always necessary. A corporate research agent may need to know that the user is authorized to approve a report. It does not need access to the user’s private reflections or family history. A coding agent needs to understand the project’s architecture and standards, not the entire personality of the engineer directing it.
It is useful to distinguish three kinds of context:
- Personal context belongs to the continuing individual. It includes preferences, writing voice, intellectual interests, and long-term goals.
- Organizational context belongs to the institution. It includes policies, repositories, terminology, workflows, and structures of authority.
- Project context belongs to a particular effort. It includes objectives, stakeholders, deadlines, source materials, and unresolved decisions.
These layers can overlap, but each has different owners, purposes, and boundaries. The more capable agents become, the more important these distinctions will be. An assistant that only drafts text creates limited risk when it misunderstands its role. An agent that can execute code, modify systems, contact people, or make operational decisions requires a far more precise understanding of authority and context.
Loop engineering is therefore not simply a new method of automation. It is also a demand for better institutional design. The user is no longer responsible only for giving good instructions. The user must create a system in which responsibilities, information, tools, and review processes are arranged coherently.
One Person, Several AI Identities
Most users do not live inside a single AI ecosystem. They may use ChatGPT for personal thinking, Claude for corporate work, Gemini for access to Google services, and specialized agents for coding, research, or document production. The same person appears differently in each environment because each system sees a different history, purpose, and set of permissions.
A personal ChatGPT account may gradually learn the user’s writing style, recurring interests, ongoing essays, and preferred ways of working. A company-provided Claude account may remain intentionally impersonal, limited to professional tasks and internal materials. Gemini may be useful because it connects to documents, calendars, or email, while a coding agent may know the structure of a repository but little about the wider life of the person directing it.
These systems do not know the same user. They know different operational identities:
- the reflective writer;
- the corporate manager;
- the document owner;
- the repository maintainer;
- the temporary participant in a specific project.
Each identity is partial, but not necessarily inaccurate. Problems arise when the boundaries among them become either too rigid to support useful continuity or too weak to protect privacy and organizational responsibility.
This separation has real value. Personal and corporate contexts should not automatically merge. An employer does not need access to an employee’s private intellectual history. A personal assistant should not casually retain confidential company information. A project agent should not inherit unrelated personal memories simply because the same human is using it.
Privacy therefore depends partly on keeping certain identities separate. The goal should not be to create one universal profile that follows the user everywhere and absorbs every part of life into a single record. Such a system could become intrusive, insecure, and difficult to govern.
Yet fragmentation creates its own difficulties. The user must repeatedly explain preferences, terminology, responsibilities, and prior decisions. One AI may understand the history of an argument but lack access to the necessary corporate files. Another may know the files but fail to understand why the user rejects a particular framing. A third may know the calendar and recent communications while missing the intellectual background of the project.
The challenge is not to give every system the complete picture. No system may need or deserve that level of access. The challenge is to make context portable while preserving separation. A user may want to transfer a writing profile between personal assistants without exporting private conversations. A company may want project knowledge to remain available when an employee leaves without inheriting that employee’s personal context.
This requires a more mature understanding of AI identity. Identity in this setting is not a single biography. It is a collection of roles, preferences, permissions, histories, and relationships. Some belong to the person, some to the organization, and some only to the project. Certain forms of context should last for years, while others should disappear when the task is completed.
The important question is not merely whether context can move between AI systems. It is whether the user and the organization can control that movement. Who owns the memory created through months of interaction? Can it be inspected, corrected, divided, exported, or deleted? Can role-based knowledge be transferred without exposing private history?
We already expect documents and code to move between platforms. Context is more difficult because it contains meaning, sequence, judgment, and trust. The next stage of AI may depend less on transferring larger quantities of information and more on transferring the right relationships without transferring the wrong ones.
When Intelligence Becomes Controlled Access
The events of June 2026 made the changing nature of frontier AI unusually visible. On June 9, Anthropic announced Claude Fable 5 and Claude Mythos 5. Fable 5 was introduced as a Mythos-class model prepared for general use, while Mythos 5 used the same underlying model with certain cybersecurity safeguards lifted for a smaller group of cyber defenders and infrastructure providers.
The distinction was not simply a matter of intelligence. It concerned permitted behavior, institutional trust, and the population allowed to use the model. Anthropic said Fable’s safeguards could redirect sensitive requests to Claude Opus 4.8 and acknowledged that this conservative system would sometimes block harmless work. According to the company, the safeguard activated in fewer than five percent of sessions on average.
Anthropic also published striking examples of the models’ practical capabilities. During early testing, Stripe reported that Fable 5 completed a codebase-wide migration across a 50-million-line Ruby codebase in one day, work that Stripe estimated would otherwise have taken a team more than two months. Anthropic’s internal scientists reported that Mythos 5 accelerated parts of protein design by roughly ten times and produced molecular-biology hypotheses that they preferred to those of Opus-class models about 80 percent of the time.
These are company-reported results rather than independent verdicts. Even so, they help explain why the release of a frontier model can no longer be treated like an ordinary software upgrade. A model capable of compressing months of engineering into days, operating autonomously across long tasks, and assisting with advanced biological research creates opportunities and risks that cannot be understood through a single benchmark score.
Only three days after the launch, the situation changed. On June 12, Anthropic announced that a U.S. government export-control directive required the suspension of access to Fable 5 and Mythos 5 for foreign nationals. Anthropic said that complying with the order required it to disable both models for all customers, including users who had already received access.
Anthropic stated that the directive appeared to concern a method of bypassing Fable’s safeguards. The company argued that the demonstrated technique was narrow, produced only previously known minor vulnerabilities, and did not constitute a universal jailbreak. It also said that Fable had undergone thousands of hours of red-team testing and that perfect jailbreak resistance was probably impossible for any current provider.
The dispute matters because it shows that model access is becoming a political and legal question as well as a technical one. The government exercised national-security authority, while Anthropic argued that the decision was not transparent, proportionate, or adequately grounded in the demonstrated risk. A disagreement over one safeguard became sufficient to interrupt access to a frontier model across the world.
The incident also revealed the privacy costs of contextual safety. Anthropic had adopted a defense-in-depth system that combined model safeguards, monitoring, and 30-day retention of customer data for Fable 5. Longer retention could help the company investigate patterns of misuse or repeated jailbreak attempts, but it also required users to accept a greater degree of observation as the price of access to the more capable model.
OpenAI’s GPT-5.6 preview, announced on June 26, reflected the same changing environment through a different process. OpenAI introduced Sol, Terra, and Luna but began with a limited preview for a small group of trusted partners. The company stated that it had shared the participating partners with the U.S. government and was starting with restricted access at the government’s request, although it also said that this should not become the permanent default for future releases.
The reason for caution was tied directly to capability. OpenAI described GPT-5.6 Sol as its strongest model and its most capable cybersecurity system. On ExploitBench, Sol performed competitively with Anthropic’s Mythos Preview while using about one-third of the output tokens. In tests involving Chromium and Firefox, it identified bugs and components that could support exploitation, although it did not autonomously produce a complete functional attack chain under the tested conditions.
OpenAI concluded that Sol had not crossed the Cyber Critical threshold in its Preparedness Framework. Yet the company also warned that benchmark thresholds cannot represent every possible combination of models, tools, and real-world environments. The model remained below the formal threshold, but its capabilities and the surrounding uncertainty were still considered significant enough to justify a phased release.
This tension reveals the limits of benchmark-centered thinking. A benchmark can tell us how a model performed under defined conditions. It cannot fully determine who should receive access, which tools the model should control, how long it may operate, or what happens when it is placed inside a more capable agentic system.
The new safety architecture is therefore moving beyond the isolated prompt. OpenAI describes a layered system involving model-level refusals, real-time cyber and biology classifiers, account-level signals, differentiated access, monitoring, and enforcement. In higher-risk cases, a generation may be paused while a larger reasoning model reviews the conversation and its context. Flagged activity may also lead to review across related conversations rather than only the most recent request.
This approach recognizes a difficult reality. A cybersecurity request may be legitimate or dangerous depending on who is asking, what system is being examined, whether authorization exists, and what action will follow. A purely mechanical safeguard cannot always distinguish responsible vulnerability research from preparation for an attack.
A contextual safety system may therefore need to consider:
- Identity: Who is making the request?
- Role: In what professional or institutional capacity are they acting?
- Authorization: Do they have permission to examine the relevant system?
- History: Does the request fit an established pattern of legitimate work?
- Tools: What networks, databases, or execution environments can the AI access?
- Action: Is the model explaining, simulating, advising, or directly performing an operation?
OpenAI’s Trusted Access for Cyber already applies this logic. Identity and trust verification can allow approved defenders to encounter fewer unnecessary refusals during authorized work. More permissive access remains limited to narrower groups, with stronger verification, account security, monitoring, and restrictions against unauthorized use.
Context should inform judgment, but it must not become permanent moral clearance. Accounts can be compromised, respected professionals can act irresponsibly, and institutions can misuse authority. A mature system must combine identity, authorization, behavior, project evidence, tool access, and the possible consequences of action.
It must also provide ways to challenge mistakes. If a legitimate researcher is denied access or a harmless project is classified as dangerous, the decision should not disappear inside an invisible system. Contextual safety requires review, transparency, and the possibility of correction.
Greater contextual understanding creates a broader privacy problem as well. To judge risk accurately, companies may collect identity information, behavioral histories, credentials, and long-term records of activity. The same information that helps distinguish legitimate work from abuse could later be used for profiling, employment decisions, political monitoring, or commercial targeting.
Safety therefore requires governance not only of the model, but also of the institutions operating it. Users need to know what is retained, how risk decisions are made, how long information remains available, and whether data collected for safety can be repurposed.
Controlled access also creates a new form of inequality. The divide may no longer be simply between people who use AI and those who do not. It may develop across several levels:
- users of basic public assistants;
- paying users with stronger general models;
- verified professionals with expanded access;
- major corporations and laboratories with advanced systems;
- governments and strategic institutions with capabilities unavailable to the public.
These differences may compound quickly. Stronger AI can accelerate research, software development, scientific analysis, cybersecurity, communication, and strategic decision-making. A person with privileged access may not merely complete the same work faster. They may be able to attempt projects that others cannot realistically begin.
Some restrictions may be necessary. The problem is not restriction itself, but restriction without accountability. If access depends mainly on wealth, nationality, institutional affiliation, or proximity to powerful technology companies, safety systems may reinforce existing hierarchies. Once access to intelligence is controlled, the method of distribution becomes part of the distribution of social power.
The Intelligence Between Us
The common judgment that AI is generic begins to look different once these distinctions are considered. A model’s practical intelligence does not reside only in its benchmark score. It also depends on the context surrounding its use, the continuity of the relationship, the knowledge available to it, the role assigned to it, and the permissions governing its actions.
The recent releases from Anthropic and OpenAI make this point unusually clear. Fable 5 and Mythos 5 used the same underlying model, yet they were offered to different groups under different safeguards. GPT-5.6 Sol achieved significant results in agentic coding and cybersecurity, yet its initial availability was shaped by trusted-partner selection, government consultation, and uncertainty that could not be resolved by benchmarks alone.
A personal AI may produce nuanced work because it understands the user’s history, voice, and intellectual development. A corporate agent may perform effectively because it has access to reliable project documents, clear authority, and carefully designed review processes. A frontier model may provide different levels of capability to different users because the consequences of access are no longer equal.
These systems may use similar underlying intelligence, yet the intelligence experienced by the user can vary greatly. Memory, tools, purpose, role, access, and trust shape what the model becomes in practice. The same technical capability may appear shallow in one setting, productive in another, and dangerous in a third.
This does not make benchmarks irrelevant. Some models reason better, follow instructions more reliably, use tools more effectively, and manage uncertainty more carefully. Benchmarks help identify these differences. They tell us something important about what a model can potentially do.
But potential is not the same as lived intelligence. A strong model without context may produce a weak answer. A knowledgeable agent inside a confused workflow may create errors. A deeply personalized assistant may become valuable while also raising concerns about privacy and dependency. A safety system may interpret context intelligently while distributing access unfairly.
We should therefore resist two simple conclusions. The first is that AI cannot be creative or intelligent because its answers sometimes feel generic. Generic answers may emerge from generic conditions. The second is that an AI should be trusted simply because it knows the user well. Context can improve judgment, but it can also support surveillance, bias, and unequal treatment.
The challenge is to create relationships with AI that are both deep and bounded. Personal systems should learn without silently absorbing everything. Corporate systems should understand roles and projects without requiring private intimacy. Agentic workflows should preserve organizational knowledge without confusing it with the identity of individual employees.
Safety systems should use context without turning reputation into permanent permission. Access controls should reduce danger without creating an unaccountable class of model privilege. Users should be able to understand which identity they are presenting, which information is being used, and which boundaries remain in place.
The current transformation cannot therefore be described only as the arrival of smarter models. Intelligence is becoming relational, operational, and governed. It is shaped by memory, identity, history, tools, permissions, institutions, and the architecture of collaboration.
The true smartness of AI may be found neither entirely inside the model nor entirely inside the human being. It appears in what becomes possible between them, and in how responsibly that shared possibility is cultivated, limited, and distributed.
Photo by Claudio Schwarz on Unsplash