Infogent

Abstract

Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce INFOGENT, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate INFOGENT beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.

Infogent overview. — **Figure 1:** Overview of INFOGENT under the Direct API Access and Interactive Visual Access settings: The Navigator uses a tool-based LLM and a browser-controlling VLM as the web agent respectively, with the Aggregator's textual feedback guiding further navigation.

Method

INFOGENT, consists of three core components: A Navigator NG, an Extractor ET , and an Aggregator AG. Given an information-seeking query, the Navigator NG initiates the process by searching the web for relevant sources. Upon identifying a suitable webpage, the Extractor ET takes over the control, which extracts relevant content and forwards it to the Aggregator AG. AG evaluates this content with respect to the information aggregated so far and decides whether to include it. Importantly, AG provides feedback to NG about gaps in the aggregated information, guiding subsequent searches to address deficiencies. NG lacks direct access to the aggregated information, thereby relies on AG's feedback for directions in subsequent iterations. This iterative process continues until AG determines that sufficient information has been gathered and instructs NG to halt.

Infogent example. — **Figure 2:** A working example of INFOGENT. Navigator iteratively generates an updated query given feedback from Aggregator.

INFOGENT employs a modular, feedback-driven approach to information aggregation, making it suitable for complex queries requiring diverse sources. Fig. 2 illustrates the feedback-driven navigation with example. Algorithm 2.1 shows a schematic of INFOGENT's working process. The Navigator's action space varies with the information access setting (see Tables 2.1 and 2.2), either utilizing a search API (Direct API-Driven Access) or interacting with a real-world browser (Interactive Visual Access).

Direct API-Driven Access

In this setting, web information is accessed via a search API that returns URLs, with content retrieved through automated scraping. The Navigator NG operates as an autonomous agent based on the ReACT framework, combining chain-of-thought reasoning with tool usage. Its action space A comprises two tools: SEARCH and AGGREGATE (see Table 2.1). Given a user task, NG employs SEARCH with an appropriate query to obtain the relevant website URLs and selects one to invoke AGGREGATE. This action combines both the Extractor ET (scrapes the URL and extracts relevant content P) and the Aggregator AG (updates S using P, and returns textual feedback F). Based on F, NG either continues exploring additional websites from previous search results or revises its search query.

Interactive Visual Access

When direct scraping is not feasible, NG navigates the web through human-like browser interactions. Leveraging LMMs for web navigation, the Navigator is based on SeeAct, a task-completion agent that uses screenshots and HTML elements to plan and execute actions. SeeAct generates natural language action descriptions and grounds them to relevant HTML elements and operations. We augment SeeAct with GO BACK and AGGREGATE actions (see Table 2.2) and modify the action generation to condition on feedback F from AG. Starting from a search engine homepage, NG uses actions like CLICK, TYPE, and PRESS ENTER to navigate. Upon finding relevant webpages, it invokes AGGREGATE to engage ET and AG. Based on feedback F, it may GO BACK to explore other options or initiate a new search with an updated query.

Performance

We test INFOGENT's ability to address complex queries that require accumulating information over multiple webpages. Evaluation is based on the final answer generated by the downstream LLM, leveraging the information aggregated by INFOGENT. We consider evaluation separately for Direct API-Driven access and Interactive Visual Access.

Direct API-Driven Access

Datasets: We evaluate on FanOutQA (dev split) and FRAMES, both comprising complex queries requiring information from multiple webpages. FanOutQA contains multi-hop questions involving multiple entities (e.g., What is the population of the five smallest countries by GDP in Europe?), while FRAMES has questions necessitating tabular, constraint-based, temporal, and post-processing reasoning.

Baselines: We primarily compare with MindSearch, a multi-agent search framework involving a planner and a searcher. MindSearch models information seeking as a dynamic graph construction process via code-driven decomposition of the user query into atomic sub-questions represented as nodes. It then iteratively builds the graph for the subsequent steps, based on answers to the sub-questions.

Interactive Visual Access

Datasets: Our main evaluation is on AssistantBench, a dataset of realistic, time-consuming online information-seeking tasks that require interaction with multiple websites, such as monitoring real estate markets or locating nearby businesses. To assess performance on information-dense websites (e.g., Wikipedia) under the interactive visual access setting, we employ a human-curated subset of FanOutQA, comprising queries with updated answers where closed-book models fail.

Baselines: We use RALM-Inst and RALM-1S are zero and one-shot versions of a retrieval-augmented LM that is prompted to use Google Search as a tool. For web-agent baselines, we consider SeeAct, designed for web task-completion. Our primary comparison is with SPA (See-Plan-Act), which extends SeeAct for information-seeking tasks by incorporating planning and memory modules for information transfer between steps.

Figure 5 illustrates how effective aggregator feedback (between steps 5 and 6 in the image) can improve answer coverage by appropriately directing the navigator.

BibTeX


      @article{reddy2024infogent,
        title={Infogent: An Agent-Based Framework for Web Information Aggregation},
        author={Reddy, Revanth Gangi and Mukherjee, Sagnik and Kim, Jeonghwan and Wang, Zhenhailong and Hakkani-Tur, Dilek and Ji, Heng},
        journal={arXiv preprint arXiv:2410.19054},
        year={2024}
      }

INFOGENT: An Agent-Based Framework for Web Information Aggregation

Abstract

Method

Direct API-Driven Access

Interactive Visual Access

Performance

Direct API-Driven Access

Interactive Visual Access

BibTeX