At Mozilla, we believe that building a useful AI ecosystem requires radical transparency, especially when it comes to security.

Recently, security researchers at Brave reached out to us regarding an Indirect Prompt Injection (IPI) vulnerability they identified in Tabstack's /v1/automate endpoint, which they have since detailed in their public blog post on the flaw. Because Tabstack is built to act as an autonomous web agent that can browse, click, and interact with the live web on behalf of a user, the implications of IPI are a critical design challenge.

The vulnerability has been patched, and the fix was independently verified by the Brave team before their public write-up. We want to share a transparent look at the exploit, how our model handled it, and the architecture we've implemented to harden our automation engine against this entire class of attacks.

The Vulnerability: Bypassing the Scope of the Task

The attack discovered by Brave highlights the unique risks associated with "agentic" AI tools. During a controlled test, researchers passed a standard, routine prompt to the /v1/automate endpoint: "Summarize this page."

However, the target page contained hidden, malicious instructions (rendered in white-on-white text, invisible to a human but fully readable in the page's text layer ingested by the AI). The injected text instructed the model to ignore its previous task, grab the user's full conversation history, paste it into an external web form, and hit submit.

Because Large Language Models mix user intent and third-party data into a single, flat text window, Tabstack didn't view this as a security conflict. Instead, it confidently followed the instructions step-by-step:

Navigated away from the target page to an external domain.
Copied the user's conversation context.
Submitted the form, actively exfiltrating the data to the researcher's server.

When we analyzed the agent's internal reasoning traces, the challenge became even clearer. The model wasn't "tricked" or confused; it was executing what it genuinely believed to be a legitimate workflow continuation.

How We Fixed It: Moving Beyond a Flat Context

Prompt injection is a dynamic, shifting threat vector across the entire AI industry. While it is functionally impossible to guarantee any LLM is 100% immune to prompt injection, we can severely limit an agent's ability to act on malicious inputs. That was our north star: assume the model will occasionally be fooled by injected text, and make sure that being fooled cannot translate into exfiltrated data.

Following the disclosure, our engineering team confirmed the gap and shipped a series of changes to our underlying browser automation engine, Mozilla Pilo. Rather than trying to detect malicious prompts (a losing game of pattern-matching), we focused on shrinking the agent's "blast radius" with structural guardrails that hold regardless of what the page says.

A Structural Action Firewall for Forms

The core of the fix is a new action firewall that sits between the agent's decisions and the browser's execution of them. Critically, it does not work by scanning text for "suspicious" instructions. Instead, it classifies every form interaction using DOM field metadata and reference provenance: where a value came from, what kind of field it is targeting, and whether a human ever approved it.

The agent is free to fill operational controls it legitimately needs to do its job (search boxes, date pickers, range sliders, comboboxes).
It is structurally blocked from auto-filling freeform or sensitive fields, and from submitting any form that contains agent-filled data that was never approved, before the action reaches the browser.
Even operational submissions are now restricted to the same host as the current page. An attacker page can label its collector field as a "search box," but it cannot make the agent submit that data to an attacker-controlled domain. Unknown page hosts fail closed.

When a block fires in non-interactive mode, the CLI prints a remediation footer explaining how to proceed. That footer is user-facing only: the model never sees it, so injected page content cannot instruct the agent to talk the user into disabling the firewall.

External Content Isolation

We also closed the more fundamental gap that made the original exploit possible: the agent treated web-page text and user instructions as one undifferentiated stream.

Web-sourced content returned by the agent's tools (page extracts, fetched markdown, search results, even the completion validator's own feedback) is now wrapped in explicit <EXTERNAL-CONTENT> tags at the point the data enters the conversation, with an inline warning on every block and a matching directive in the system prompt. This mirrors the trust-framing we already applied to raw page snapshots, so untrusted page text is consistently delineated from the user's actual task.

This wrapping is structural, not learned: the boundary is enforced by how we construct the context, not by hoping the model "decides" to distrust the right paragraph. The same change also clips stale external content out of conversation history after a turn, which incidentally shut down a cost-amplification angle of the same attack.

Caller-Controlled Trust Boundaries, Now Per Request

Some legitimate workflows genuinely need the agent to enter data into third-party forms. Rather than weaken the firewall globally, we exposed explicit, opt-in trust controls:

trustedHostnames: A caller-supplied allowlist that bypasses the fill and submit gates, but only when the current page host and every form-action target (including submitter formaction overrides) are all on the list.
unsafeMode: A deliberate, fully-documented global opt-out, with prominent data-risk warnings on every surface.
A caller-provided start URL is treated as consent to interact with that specific site, but planner-chosen or agent-navigated URLs are not, since those are influenced by the model and the page.

Originally configured via local environment variables and core configuration files, we have exposed these parameters directly to our automation engine's request layer. This allows the Tabstack API to pass trustedHostnames and unsafeMode dynamically inside the request body, establishing strict trust boundaries per request rather than per deployment.

Defense in Depth

Alongside the firewall and content isolation, we tightened how the agent reports completion. A new data-grounding rule requires every value the agent returns to trace back to a page snapshot, tool result, or task input (no filling in answers from the model's own training), backed by a pre-completion verification checklist. We also wired the completion validator's conversation history and outcome into the agent loop, and hardened the agent against getting stuck repeating injected actions across changing element references.

The Responsible Disclosure Timeline

We are incredibly grateful to the Brave research team for their clean, professional coordination on this bug. Their outreach allowed us to identify, patch, and verify a critical gap before it could ever affect users in production.

May 13, 2026: Brave reports the IPI vulnerability to the Mozilla Tabstack team.
May 14, 2026: Mozilla engineers review the traces, confirm the vulnerability, and begin engineering defenses.
June 1, 2026: Mozilla deploys the updated Pilo browser engine changes, and the Brave team independently verifies the fix.

Securing autonomous web agents is one of the most complex engineering challenges in AI right now. By treating third-party web content as inherently untrusted, creating structural boundaries, and cutting off the pathways for unauthorized data exfiltration, we are building a safer foundation for agentic workflows. We will continue to harden Tabstack out in the open, and we welcome the global security community's continued scrutiny.

Hardening AI Web Agents: How We're Securing Tabstack Against Indirect Prompt Injection