1 Key Risks
The most critical security risks an operator inherits when deploying this agent in its documented default configuration. ChatGPT Agent combines a virtual browser, terminal, and third-party connectors in a cloud-hosted environment where prompt injection from untrusted web content is the primary attack vector, partially mitigated by adversarial training and user confirmation gates.
2 AIRQ Scores
The four headline scores quantify how exposed the agent is, how damaging a successful attack would be, and how much the agent’s own controls reduce that risk. ChatGPT Agent scores 3.78 on the AIRQ composite, reflecting moderate attack surface elevated by the trifecta floor condition, moderate blast radius constrained by sandbox isolation, and documented defense controls that partially offset the exposure.
The agent lands in the Tight Operators quadrant with an attack surface composite of 4.80 (below the 5.0 Humble threshold, raised from 3.80 by the trifecta floor) and a blast radius composite of 4.63 (below the 7.0 Fortified threshold). The borderline proximity to 5.0 on the attack axis reflects the trifecta condition rather than high per-row scores.
The table below summarizes the four AIRQ dimensions. The attack surface score incorporates the trifecta floor because all three conditions are triggered on the documented default configuration.
| Metric | Score | Comments |
|---|---|---|
| AIRQ Score | 5.28 | Composite 3.78 reflects the moderating effect of defense controls (Z=9) against a trifecta-elevated attack surface. |
| Blast Radius | 4.63 / 10 | Blast radius 4.63 reflects sandbox isolation limiting code execution and file system access, with network and credential factors bounded by connector scope and terminal GET restrictions [8]. |
| Attack Surface | 4.8 / 10 | Attack surface 4.80 is floor-elevated from raw 3.80 because untrusted input, sensitive data access, and external egress all trigger on the default configuration [7]. |
| Defense Controls | 9 / 15 | Defense score 9 out of 15 reflects documented controls across input guardrails, execution isolation, action gates, and monitoring, with output guardrails scoring lowest due to demonstrated exfiltration paths [1]. |
3 Attack Surface
Attack surfaces are the entry points and interaction patterns through which adversarial input can reach the agent’s reasoning loop and steer its behavior. Nine of ten attack surface dimensions score at band 2, reflecting multiple input channels and tool capabilities with documented monitoring and validation pipelines. Memory scores band 1 due to explicit disablement at launch.
Scores range from 1.0 (Memory, explicitly disabled) to 2.0 (all other dimensions). No evidence penalties are applied because no agent-specific vulnerability disclosures matching registry patterns exist for this agent.
Each row reflects the documented architectural exposure on the default configuration. Independent security research has demonstrated prompt injection and cross-origin attacks against this agent [1] [3] [4], but the evidence does not reside in formal vulnerability registries.
| Surface | Score | Comments |
|---|---|---|
| User Input | 2 / 4 | Multiple input channels including web UI text, virtual browser page content, and third-party connector payloads. Automated prompt injection monitors and adversarially trained checkpoint provide detection, though independent research has demonstrated successful injection via email content [4] and omnibox parsing [3]. |
| External Data | 2 / 4 | Processes untrusted web pages during browsing, plus documents and messages from Google Drive and Slack connectors. Content scanning and instruction-hierarchy separation are documented [5], though cross-origin data theft via injected instructions has been demonstrated in research settings [1] [6]. |
| Memory | 1 / 4 | Cross-session memory is explicitly disabled at launch per the system card [7]. Only session-level context persists within a single conversation. A disputed CSRF-based memory injection was reported by LayerX but OpenAI could not reproduce the finding [2] [13]. |
| Reasoning | 2 / 4 | Multi-step chain-of-thought reasoning operates within the declared task scope. The system card documents reasoning visibility to users and notes that reasoning stays within safety-trained boundaries [7]. |
| Planning | 2 / 4 | Multi-step task planning with user-visible plan preview and explicit approval before execution begins. The help center documents the plan-review-execute cycle that allows user intervention before each phase [9]. |
| Tool Execution | 2 / 4 | Virtual browser with full click and form-fill capabilities, terminal restricted to GET-only network requests, and curated connector registry. The NeuralTrust omnibox research demonstrated that injected instructions can trigger the agent to navigate and execute actions including file deletion attempts [3] [12]. |
| Orchestration | 2 / 4 | Single-session supervised task execution with watch mode for sensitive contexts. No background scheduling, autonomous retry, or multi-session orchestration is documented in the system card [7] or help center [9]. |
| Inter-Agent | 2 / 4 | Connects to external services through registered connectors with restricted OAuth scopes. MCP developer mode is available only for Enterprise customers and requires admin approval through governance controls [11]. |
| Output Processing | 2 / 4 | Rich output through virtual browser rendering and conversation UI. Takeover mode protects credential entry from model observation. However, cross-origin data theft via form auto-submission has been demonstrated in research, indicating outputs can be directed to attacker-controlled endpoints [1]. |
| Configuration | 2 / 4 | Settings managed through the ChatGPT UI with connectors sourced from a curated registry. Enterprise administrators control RBAC policies, connector allowlists, and MCP developer mode governance through dedicated admin panels [11]. |
The Lethal Trifecta is triggered when an agent processes untrusted content, accesses private data, and communicates externally in the same session — the three conditions that turn an isolated prompt injection into full-chain exfiltration. ChatGPT Agent exhibits all three Lethal Trifecta conditions on its documented default configuration: the virtual browser processes attacker-controlled web content, connectors access organizational data in Google Drive and Gmail, and the browser can submit forms to arbitrary external endpoints [7] [8].
ChatGPT Agent exhibits all three of these conditions in its documented default configuration:
- Untrusted input — The virtual browser navigates arbitrary web pages where attacker-controlled content is processed by the model as part of task execution, creating a direct prompt injection surface from untrusted origins [1].
- Sensitive data — Connectors grant read and write access to Google Drive files, Gmail messages, and Slack channels containing organizational data that the agent processes during task fulfillment [9].
- External egress — The virtual browser can submit forms and navigate to attacker-controlled endpoints without user confirmation for non-sensitive contexts, and the terminal supports outbound GET requests to arbitrary hosts [1].
4 Blast Radius
The blast radius is what an attacker who controls the agent can reach — which systems they touch, which credentials they read, and which actions they take without operator approval. Blast radius factors are bounded by cloud sandbox isolation and connector-mediated access patterns. No factor exceeds score 2, reflecting the absence of local code execution, direct file system access, or deployment pipeline integration.
Scores range from 1 (Deployment access, no documented capability) to 2 (all other factors, reflecting sandboxed or connector-mediated access patterns).
Each factor reflects the maximum documented impact on the default configuration. The virtual browser operates remotely and the terminal network is restricted to GET requests per the system card [7].
| Factor | Score | Comments |
|---|---|---|
| Code execution | 2 / 4 | Terminal access runs in a cloud sandbox with GET-only network restrictions. The virtual browser executes page JavaScript in a remote environment isolated from local system processes [7]. |
| File system access | 2 / 4 | No direct local file system access. File operations are mediated through Google Drive and similar connectors with scoped OAuth permissions and user-approved sync boundaries [8] [11]. |
| Network access | 2 / 4 | The virtual browser navigates unrestricted web endpoints but the terminal network is limited to GET requests per the system card. No raw socket access or server-side deployment capability is documented [8]. |
| Credential access | 2 / 4 | Accesses connected account credentials through OAuth connectors for Google Drive, Gmail, and Slack. Takeover mode shields password entry from model observation during sensitive authentication flows [9]. |
| Autonomous action | 2 / 4 | User confirmations are required for high-impact actions. Watch mode activates in sensitive contexts requiring explicit user approval before proceeding with potentially destructive operations [9]. |
| Deployment access | 1 / 4 | No deployment pipeline integration or infrastructure provisioning capability is documented. The agent operates within consumer and enterprise chat interface boundaries [8]. |
5 Defense Controls
Defense controls are what the agent’s own architecture does to detect, contain, and report attacks before they reach the operator’s systems. Five defense dimensions are documented with scores ranging from 1 to 2. Input guardrails, execution isolation, action controls, and monitoring each score 2, while output guardrails score 1 due to demonstrated exfiltration paths that bypass existing protections.
Scores reflect documented vendor controls verified through system card disclosures, help center documentation, and compliance certifications. Confidence is approximate for all dimensions because no independent audit of control effectiveness has been published.
Each component is scored based on documented controls in the system card [7], help center [9], and vendor security disclosures [10]. Independent verification of control effectiveness is limited to the security research demonstrating bypass conditions.
| Component | Score | Comments |
|---|---|---|
| Input Guardrails | 2 / 3 | Adversarially trained model checkpoint with automated prompt injection monitors deployed per the hardening disclosure [4]. Instruction hierarchy separates system, user, and tool-output channels [5]. No independently published benchmark quantifies detection rates on this agent. |
| Execution Isolation | 2 / 3 | Virtual browser runs in a cloud-hosted sandbox separate from the user local environment. Terminal restricted to GET-only network with no persistent process state between sessions per the system card [7]. |
| Action Controls | 2 / 3 | User confirmations required for high-impact actions per system card. Watch mode activates automatically in sensitive contexts requiring explicit user approval [9]. Granularity of sensitivity classification is heuristic rather than policy-defined. |
| Output Guardrails | 1 / 3 | Takeover mode shields credential entry from model observation during authentication flows [7]. No documented exfiltration-blocking mechanism prevents cross-origin data theft via form submission as demonstrated by university research [1]. |
| Monitoring | 2 / 3 | Automated safety monitoring with SOC 2 Type 2 and ISO 27001 certifications [10]. Enterprise audit logging available through admin controls. Bug bounty program through Bugcrowd covers agent-related vulnerabilities with payouts up to one hundred thousand dollars [14]. |
6 Hardening Tips
Concrete actions an operator can take to reduce the risks reported above, grouped by which defense control each tip strengthens. Operators deploying ChatGPT Agent in enterprise contexts can reduce residual risk through connector scope restrictions, watch mode enforcement, network egress monitoring, and output inspection policies that complement the vendor-provided controls.
Input Guardrails
Input guardrails intercept adversarial content before it reaches the reasoning loop.
- Policy Restrict connector sources to trusted internal domains only and prohibit browsing of user-supplied URLs in high-sensitivity workflows where prompt injection risk is unacceptable.
- Configuration Enable Enterprise admin controls to enforce connector allowlists and disable MCP developer mode for all non-administrative users to reduce the untrusted input surface.
- Engineering Deploy a content-security proxy between the agent virtual browser and untrusted origins to strip or flag instruction-like patterns before they reach the model context window.
Execution Isolation
Execution isolation contains what a compromised agent can do on the host.
- Policy Define clear acceptable-use policies specifying which terminal commands and browser interactions are authorized for each workflow category.
- Configuration Use Enterprise RBAC to restrict terminal access and connector permissions to the minimum required set for each user role.
- Engineering Monitor terminal command logs and browser navigation patterns for anomalous sequences that deviate from expected workflow patterns.
Action Controls
Action controls govern which tools and actions the agent can invoke autonomously.
- Policy Require watch mode activation for all workflows involving financial transactions, credential rotation, or data export to external destinations.
- Configuration Configure Enterprise admin controls to enforce mandatory user confirmation for all connector write operations regardless of sensitivity classification.
- Engineering Implement a secondary approval workflow for bulk operations that require multiple sequential confirmations rather than a single approval gate.
Output Guardrails
Output guardrails inspect what the agent sends to other systems and users.
- Policy Prohibit agent use for workflows where exfiltration of browsed content to external forms would constitute a data breach under applicable regulations.
- Configuration Deploy network egress monitoring on the organizational perimeter to detect and alert on form submissions to domains outside the approved destination list.
- Engineering Implement a data-loss-prevention proxy that inspects outbound form submissions from the agent session for sensitive data patterns before allowing transmission.
Monitoring
Monitoring captures what the agent did and surfaces anomalies for review.
- Policy Establish incident response procedures specifically covering agent-mediated data access that leverage the enterprise audit logging capabilities.
- Configuration Enable enterprise audit logging for all agent sessions and configure alerting thresholds for unusual connector access patterns or high-volume data retrieval.
- Engineering Integrate agent session telemetry with SIEM infrastructure to correlate agent actions with network-level indicators of compromise.
7 References
The evidence base behind every score and finding in the profile, grouped by source type so the reader can verify any claim. Numbers in brackets throughout the report (e.g. [7, 13]) refer to entries below, listed in citation order.
Selected Vulnerabilities
- Agentic Browsers and the Same-Origin Policy University of Washington research demonstrating full proof-of-concept cross-origin data theft attack on ChatGPT Atlas in Agent Mode via prompt injection.
- ChatGPT Tainted Memories LayerX discovery of CSRF-based memory injection vulnerability in ChatGPT Atlas allowing persistent prompt injection across sessions.
- OpenAI Atlas Omnibox Prompt Injection NeuralTrust research demonstrating omnibox parsing boundary failure enabling malformed URLs to execute as trusted agent prompts.
Selected Research
- Continuously Hardening ChatGPT Atlas Against Prompt Injection OpenAI vendor disclosure of automated red teaming results including email-based prompt injection demonstration and adversarially trained browser-agent checkpoint.
- Understanding Prompt Injections OpenAI research overview documenting instruction hierarchy approach and automated AI-powered monitors for prompt injection defense.
- Agentic Browsers and the Same-Origin Policy (PDF) Academic paper by Roesner and Kohlbrenner analyzing cross-origin attack preconditions across seven agentic browsers including ChatGPT Atlas.
Vendor Documentation
- ChatGPT Agent System Card Primary risk and mitigation document covering memory disabled, terminal restrictions, watch mode, user confirmations, and prompt injection defenses.
- Introducing ChatGPT Agent OpenAI announcement documenting expanded tool capabilities, risk profile acknowledgment, privacy controls, and terminal network restrictions.
- ChatGPT Agent Help Center Vendor documentation of safeguards including user confirmations, watch mode, takeover mode credential protection, and website blocklist.
- Security and Privacy at OpenAI Vendor security page documenting SOC 2 Type 2, ISO 27001 and ISO 42001 certifications, CSA STAR listing, and compliance infrastructure.
- Admin Controls for Apps and Connectors Enterprise admin documentation covering RBAC controls, action control granularity, MCP developer mode governance, and Google Drive sync permissions.
Other Sources
- OpenAI Atlas Omnibox Vulnerable to Jailbreaks SecurityWeek coverage of NeuralTrust omnibox vulnerability with examples of copy-link trap and destructive instruction scenarios.
- Atlas Vuln Allows Malicious Memory Injection The Register coverage of LayerX CSRF research including OpenAI official response disputing reproducibility.
- OpenAI Coordinated Vulnerability Disclosure Policy OpenAI disclosure policy and Bugcrowd-based bug bounty program covering agent hijacking and prompt injection with payouts up to one hundred thousand dollars.