Upgrading Pilo to Support Human-in-the-Loop Browser Automation

What happens when an AI web agent needs context it does not have? With Pilo Interactive Mode, it no longer has to guess or fail. This new beta feature lets Pilo pause its reasoning loop, ask the user or a parent agent for the missing data, and seamlessly complete the task.

·5 min read
Upgrading Pilo to Support Human-in-the-Loop Browser Automation

When we open-sourced Pilo, our goal was to provide a robust execution engine for AI web agents capable of navigating the chaos of the modern web. By leveraging the accessibility tree and intelligent reasoning loops, Pilo handles complex browser orchestration reliably. But autonomous execution has a hard limit: missing context.

Today, we are excited to introduce Interactive Mode (Beta) for Pilo. This new feature allows the Pilo agent to pause its execution loop and ask the caller, whether that is a human user or a parent agent, for the specific information it needs to complete a task.

The Problem of Missing Context

In traditional web automation, agents are expected to operate entirely on their own once given a prompt. But what happens when an agent encounters a step that requires specific, localized knowledge?

Imagine giving an agent the prompt: "Sign up for the mozilla newsletter."

The agent can successfully generate a plan, navigate to the Mozilla website, locate the newsletter signup page, and identify the required input fields. But when it tries to submit the form, it hits a roadblock: it doesn't know your email address or your subscription preferences. Previously, this would result in a failed task or a hallucinated input. The agent was trapped in a silo, unable to request the missing piece of the puzzle.

Enter Interactive Mode

Interactive Mode transforms Pilo from a purely autonomous executor into a collaborative agent. Instead of failing when it lacks information, Pilo can now dynamically halt its loop and prompt the caller for guidance.

For our first iteration of Interactive Mode, we are focusing specifically on form completion. When Pilo encounters a form requiring personal data, credentials, or preferences it doesn't possess, it will output a request for input. Once the caller provides the missing information, Pilo directly injects it into the form and resumes the agentic loop to complete the task.

How It Works Across the Stack

Interactive Mode integrates deeply into Pilo's core execution engine rather than relying on the LLM to decide when to ask for help. When Pilo encounters a form, it uses a "fill gate" mechanism to inspect the required fields. If it detects that user input is needed, Pilo bypasses the LLM entirely to pause task execution and request the data.

Because Pilo can be run in several different environments, we designed the interactive feedback loop to adapt to how you are using it:

  • Pilo Core: When you are building directly on top of the Pilo library, the core engine expects a simple callback function. When the execution loop pauses, it triggers your callback with the required fields, waiting for your application logic to return the necessary data before resuming.
  • Pilo CLI: For developers testing locally, the Pilo CLI handles this automatically. When the core requests information, the CLI pauses the terminal output and interactively prompts the user for the missing fields right in the console.
  • Pilo Server: For remote execution, the Pilo Server emits an interactive:form_data:request event containing the full field data. It utilizes a dedicated WebSocket endpoint (/pilo/run), enabling real-time, bidirectional communication to exchange user data while the task is suspended.

Furthermore, Pilo captures form validation states directly within its ARIA tree snapshots. If a user submits an invalid email, the agent detects the field error in the snapshot and will automatically trigger a re-prompt for the corrected information.

Seeing it in Action with Tabstack

The Tabstack /automate endpoint is built directly on top of the Pilo Server. This means you can test the WebSocket-driven Interactive Mode today using the Tabstack SDK.

Here is what it looks like to catch those interactive requests and supply the missing context:

from tabstack import Tabstack
 
# Initialize the Tabstack client
client = Tabstack(
    api_key="YOUR_TABSTACK_API_KEY",
)
 
# Start a task with Interactive Mode enabled
stream = client.agent.automate(
    task="signup for the mozilla newsletter",
    interactive=True,
)
 
# Listen for events in the stream
for event in stream:
    print(event)
 
    # Catch the specific event where the agent asks for form data
    if event.event == "interactive:form_data:request":
        request_id = event.data.get("requestId")
        fields = event.data.get("fields", [])
 
        print(f"\n--- Interactive input requested (requestId: {request_id}) ---")
        
        field_values = []
        
        # Dynamically prompt the user for the missing fields
        for field in fields:
            ref = field.get("ref", "")
            label = field.get("label", "")
            required = field.get("required", False)
            
            prompt = f"  {label}{'*' if required else ''}: "
            value = input(prompt)
            field_values.append({"ref": ref, "value": value})
 
        # Submit the user's input back to the agent to resume the task
        response = client.agent.automate_input(
            request_id,
            fields=field_values,
        )
        print(f"\n--- Input response: {response} ---\n")

By decoupling the reasoning engine from the data source, Interactive Mode allows you to build more adaptable, resilient pipelines. Whether you are typing in the terminal or wiring Pilo into a complex backend where a master orchestrator agent automatically supplies the missing data, the engine easily adapts.

Just the First Step

It is important to note that Interactive Mode is currently in Beta. This initial release is highly optimized for form-filling scenarios, but this is only the first step in our vision for collaborative agents.

As we continue to test and build out this functionality, Pilo's interactive capabilities will become much more advanced. Future iterations will support complex scenarios like resolving ambiguous instructions mid-task, requesting visual confirmations before executing destructive actions, and handling multi-step interactive workflows.

We built Pilo to be the most reliable open-source foundation for web agents, and giving those agents the ability to ask questions is a massive step forward in making them truly useful.

Check out the updated documentation on the Mozilla Pilo GitHub repository or try out the SDK at Tabstack.ai today.

Infrastructure for Web Agents

Get Started
Read more about how Tabstack worksView Documentation