index | text | url |
---|
0 | Published as a conference paper at ICLR 2024
A REAL-WORLD WEBAGENT WITH PLANNING,
LONG CONTEXT UNDERSTANDING, AND
PROGRAM SYNTHESIS
Izzeddin Gur1∗ Hiroki Furuta1,2∗† Austin Huang1 Mustafa Safdari1 Yutaka Matsuo2
Douglas Eck1 Aleksandra Faust1
1Google DeepMind, 2The University of Tokyo
izzeddin@google.com, furuta@weblab.t.u-tokyo.ac.jp
ABSTRACT
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the
performance on real-world websites has still suffered from (1) open domainness,
(2) limited context length, and (3) lack of inductive bias on HTML. We introduce
WebAgent, an LLM-driven agent that learns from self-experience to complete
tasks on real websites following natural language instructions. WebAgent plans
ahead by decomposing instructions into sub-instructions, summarizes long HTML
documents into task-relevant snippets, and acts on websites via Python programs
generated from those. We design WebAgent with Flan-U-PaLM, for grounded code
generation, and HTML-T5, a new pre-trained LLM for long HTML documents
using local and global attention mechanisms and a mixture of long-span denoising
objectives, for planning and summarization. We empirically demonstrate that our
modular recipe improves the success on real websites by over 50%, and that HTMLT5 is the best model to solve various HTML understanding tasks; achieving 18.7%
higher success rate than the prior method on MiniWoB web automation benchmark,
and SoTA performance on Mind2Web, an offline task planning evaluation.
1 INTRODUCTION
Large language models (LLM) (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023) can solve
a variety of natural language tasks, such as arithmetic, commonsense, logical reasoning, question
answering, text generation (Brown et al., 2020; Kojima et al., 2022; Wei et al., 2022), and even
interactive decision making tasks (Ahn et al., 2022; Yao et al., 2022b). Recently, LLMs have also
demonstrated success in autonomous web navigation by controlling computers or browsers to follow
natural language instructions through multi-step reasoning and decision making (Furuta et al., 2023;
Gur et al., 2022; Kim et al., 2023).
However, web automation on real-world websites has still suffered from (1) the lack of pre-defined
action space, (2) much longer HTML documents than simulated observations, and (3) the absence
of domain-specific knowledge for understanding HTML documents (Figure 1). Considering the
open-endedness of real-world websites and the complexity of instructions, defining appropriate
action spaces in advance is challenging. In addition, although several works have argued that recent
LLMs with instruction-finetuning or reinforcement learning from human feedback improve HTML
understanding and web automation accuracy (Furuta et al., 2023; Kim et al., 2023), their architectures
are not always suitable to process real-world HTML documents; as presented in Figure 2, HTML
tokens of real websites are much longer than those of simulators, and most LLMs have shorter context
lengths than the average HTML tokens in real websites. It is prohibitively costly to treat such long
documents as inputs directly, or adopt prior techniques such as text-XPath alignment (Li et al., 2021b)
or text-HTML token separation (Wang et al., 2022a). To prioritize broader task generalization and
model-size scaling, such domain knowledge for HTML documents is ignored in recent LLMs.
*Equal Contribution.
†Work done as Student Researcher at Google.
1
arXiv:2307.12856v4 [cs.LG] 25 Feb 2024 | https://arxiv.org/pdf/2307.12856.pdf |
1 | Published as a conference paper at ICLR 2024
Simulated Website
Language Model
Agent
Pre-defined
Action
Simplified
HTML
Open-ended
Action
Long & Messy
HTML
Human Instruction
Real Website
Figure 1: Challenges in real-world web automation. Recent language model agents (Furuta et al., 2023; Gur et al.,
2022; Kim et al., 2023; Yao et al., 2022b) can navigate simulated websites (Shi et al., 2017; Yao et al., 2022a),
where the agents manipulate pre-defied actions and receive simplified HTML documents that are easy to parse.
In contrast, language model agents continue to face challenges in navigating real-world websites, where they
must interact with dynamic environments, handle open-ended actions (actions that cannot be pre-determined),
and process lengthy HTML documents containing significant amounts of task-irrelevant information.
0 MiniWoB Q&A Community Social Media NewsE-Commerce Real Estate
2
4
6
8
10
12
14
N
u
m
b
e
r
o
f
H
T
M
L
T
o
k
e
n
s
(
1
0
3
)
Sim
Real
Figure 2: Statistics of HTML tokens among
real websites. Compared to simulator (about
0.5K tokens on average), HTML tokens
of real websites are much longer (from
7K to 14K), which takes up the context
length of large language models. As preprocessing, we remove the irrelevant tags
(e.g. <script>, <meta>) and keep necessary attributes (e.g. id, type, value).
In this work, we introduce WebAgent, an LLM-driven
autonomous agent that learns from self-experience to
complete user instructions on real websites by combining canonical web actions in a program space (Figure 3).
WebAgent (i) plans sub-instructions for each step by
decomposing natural language instructions, (ii) summarizes long HTML documents into task-relevant snippets based on the plan, and (iii) acts via programming
on real websites by grounding sub-instructions and HTML
snippets into executable Python codes. We combine two
LLMs to form WebAgent: newly introduced HTML-T5, a
domain-expert pre-trained language model, for task planning and conditional HTML summarization and Flan-UPaLM (Chowdhery et al., 2022; Chung et al., 2022) for
grounded code generation. HTML-T5 has an encoderdecoder architecture and is specialized to capture the structure of long HTML documents better by adopting local
and global attention mechanisms (Guo et al., 2022). It is
pre-trained using a mixture of long-span denoising objective (Tay et al., 2022) on a large-scale HTML
corpus extracted from CommonCrawl. To ground language model agents into real websites, we
introduce self-experience supervision, where the domain-expert language models are finetuned with
data generated by scripted planning/summarization and self-generated programming.
Existing LLM-driven agents often solve decision making tasks with a single LLM conditioned
on different prompts per role (Kim et al., 2023; Sun et al., 2023; Zheng et al., 2023), which is,
however, not enough for real-world tasks whose complexity is higher than that of simulators. The
empirical evaluations reveal that our method incorporating self-bootstrapped specialist language
models improves HTML understanding and grounding, and achieves better generalization than single
LLM agent. In real-world web automation, WebAgent significantly increases the success rate by 50%,
and error analysis emphasizes that coupling task planning with HTML summarization in specialized
language models is essential for task success. Moreover, HTML-T5 not only works as a core module
for WebAgent but also achieves strong results by itself on the web-based tasks. On MiniWoB++ (Liu
et al., 2018; Shi et al., 2017), HTML-T5 achieves 18.7% higher success than previous language
model agent (Gur et al., 2022) while also outperforming competitive baselines, such as naive localglobal attention models (Guo et al., 2022) and its instruction-finetuned ones (Chung et al., 2022).
On the Mind2Web (Deng et al., 2023), an offline task planning dataset, HTML-T5 achieves SoTA
performance among Synapse (Zheng et al., 2023) with GPT-3.5, and MindAct with FLan-T5-XL and
GPT-4 (OpenAI, 2023). In summary, our key contributions are:
• We propose WebAgent, integration of two modular LLMs under self-supervision for real-world
web automation. The domain-expert language model handles planning and HTML summarization,
and a generalist language model generates executable Python programs.
• We newly introduce HTML-T5 – a language model with local-global attention mechanism that
is pre-trained with a mixture of long-span denoising objective on a large-scale HTML corpus,
curated from CommonCrawl, to capture the syntax and semantics of HTML better.
2 | https://arxiv.org/pdf/2307.12856.pdf |
2 | Published as a conference paper at ICLR 2024
• WebAgent notably improves the success rate by over 50% in real websites. When fine-tuned on
downstream demonstrations, HTML-T5 itself outperforms prior language model agent by 18.7%
in MiniWoB++, and achieves SoTA performance in Mind2Web, even surpassing GPT-4.
2 RELATED WORKS
HTML-T5
Encoder
Finetuned
HTML-T5
Decoder
Finetuned
Flan-U-PaLM
Decoder
Frozen ❄
Navigation Instruction
History
HTML
Sub-Instruction HTML Snippets
Few-shot
Prompt
Web Automation Program
Figure 3: WebAgent is a combination of LLMs:
HTML-T5 for planning and summarization, and
Flan-U-PaLM for grounded program synthesis.
It is better suited for the real-world tasks; open
domain action space, complex natural language instructions, and long HTML documents. See Appendix D for examples.
Web Automation Web automation is a sequential decision making task where agents manipulate browsers
following given instructions (Shi et al., 2017), such
as form filling (Diaz et al., 2013) or information retrieval (Adolphs et al., 2022) through the sequence of
computer actions (Li et al., 2020; Mazumder & Riva,
2020; Shvo et al., 2021). Prior works have realized the
web automation via reinforcement learning (Gur et al.,
2019; Humphreys et al., 2022; Jia et al., 2019; Shaw
et al., 2023), finetuned (Furuta et al., 2023; Gur et al.,
2022) or prompted LLMs (Kim et al., 2023; Sun et al.,
2023; Yao et al., 2022b; Zheng et al., 2023) on the simulated websites (Shi et al., 2017; Toyama et al., 2021;
Yao et al., 2022a). However, there are still huge gaps
between simplified simulators and real web environments; for instance, the average tokens for HTML
pages are about 15 times larger (Figure 2), and pre-defined action space for specific websites is a
strong assumption that may harm the generalization to out-of-distribution web pages or instructions.
MindAct (Deng et al., 2023) could be the most relevant work, where finetuned language model
summarizes the raw HTML document into task-relevant snippets, and another model predicts the
web actions in a multi-choice QA format. While MindAct also combines several language models, it
has just adopted DeBERTa (He et al., 2021) and Flan-T5 (Chung et al., 2022) for summarization and
actor modules, and evaluated it on the offline dataset. In contrast, we design HTML-T5, specialized
for web-based tasks, to handle long HTML documents. WebAgent leverages HTML-T5 finetuned
with self-experience for summarization and planning, and Flan-U-PaLM as a capable programmer,
which enables it to generate open-ended web actions and to act on online real-world websites.
Program Synthesis In addition to common LLMs (Brown et al., 2020; Chowdhery et al., 2022;
Touvron et al., 2023), several works have proposed programming-focused language models (Chen
et al., 2021a; Feng et al., 2020; Li et al., 2022; Wang et al., 2021) and their benchmarks (Austin
et al., 2021; Hendrycks et al., 2021a; Lu et al., 2021). Another line of work has investigated the
tool augmentation of LLMs (Parisi et al., 2022) by decoding API calls (Schick et al., 2023) or
Python snippets to be parsed with the interpreter (Gao et al., 2023). Most works deal with the
program synthesis on the static dataset, except for the attempts in robotics (Liang et al., 2023) and
game (Trivedi et al., 2022; Wang et al., 2023a), where LLMs output Python or JavaScript snippets to
command the agents. Similarly, we leverage the ability of code generation as an open-ended action
space for web-based agents to manipulate the real website, and demonstrate LLMs can sequentially
decode Python selenium codes considering the given sub-instructions and HTML in the prompts.
See extended related works on document understanding and LLM for task planning in Appendix B.
3 WEBAGENT
WebAgent is a new architecture that combines two LLMs to achieve efficient real-world web automation. HTML-T5, a domain-expert LLM, is responsible for predicting the next sub-instruction
(planning) and generating related HTML snippets (summarization). Flan-U-PaLM (540B) (Chowdhery et al., 2022; Chung et al., 2022), is prompted to generate executable Python programs based on
the planning and summarization provided by HTML-T5 (Figure 3). This modular two-stage approach
enables WebAgent to effectively navigate and process HTML documents.
Workflow Users initiate natural language interactions with a clear intent, such as apartment searching.
Upon receiving the initial request, HTML-T5 formulates a “go to <URL>” sub-instruction, triggering
Flan-U-PaLM to generate a corresponding Python program that navigates to the specified website.
The raw HTML content of the newly opened page is extracted and fed into HTML-T5 along with the
3 | https://arxiv.org/pdf/2307.12856.pdf |
3 | Published as a conference paper at ICLR 2024
Encoder Decoder
Transient Global Attention
Local
Attention
</, id=, In, id=", ">, for, type, =", div>, ...
<form class=", type="submit">, id="uName, …
Span Length = 3 → Noisy
Span Length = 8 → Meaningful
Local and Global Attention Mechanism in Encoder Transformer
HTML-Denoising
Figure 4: HTML-T5 consists of (1) local and global attention mechanisms (Ainslie et al., 2020; Guo et al., 2022)
and (2) a mixture of denoising objectives (Tay et al., 2022) with longer-span corruption on large-scale HTML
corpus. The local and global attention mechanisms are suitable for the hierarchical tree structures of HTML
documents. Because of the sparsity of content tokens in HTML, short mean span length (e.g. µ = 3), often
used in prior works (Raffel et al., 2020), only masks less meaningful chunks. Employing longer span length (e.g.
µ = 8) helps pre-trained language models to capture the syntax and semantics of HTML better.
user’s instruction and previous planning steps. This information is utilized to predict the next subinstruction and identify relevant reference IDs for extractive HTML summarization. Flan-U-PaLM,
in turn, generates a Python program based on these sub-instructions and combined HTML snippets.
This iterative process of planning, summarization, and program synthesis continues until a designated
end-of-episode sub-instruction is predicted or the maximum number of iterations is reached.
3.1 HTML-T5
Prior research has shown that general-purpose LLMs, such as T5 (Raffel et al., 2020), Flan-T5 (Chung
et al., 2022), and InstructGPT (Ouyang et al., 2022), can effectively navigate web environments
(Furuta et al., 2023; Gur et al., 2022; Kim et al., 2023). However, unlike specialist transformer
models (Li et al., 2021b; Wang et al., 2022a; Zhao et al., 2022), these general-purpose LLMs do
not fully utilize the HTML-specific information that could otherwise lead to better understanding of
HTML content. To address this limitation, we introduce HTML-T5, a pre-trained encoder-decoder
language model specifically designed for HTML-based web automation tasks. HTML-T5 carefully
merges the generalist and specialist characteristics of language models. It processes HTML in a
text-to-text manner and employs local and global attention mechanisms (Ainslie et al., 2020) in the
encoder to capture the hierarchical structure of long HTML inputs. HTML-T5 is pre-trained on
a large-scale HTML corpus curated from CommonCrawl using a mixture of long-span denoising
objectives (Tay et al., 2022), and then finetuned it for each downstream task. For WebAgent, we
employ the self-experience supervision approach to align the model with real websites.
Model Architecture Unlike natural language, HTML documents possess an explicit hierarchical
structure. This structure comprises elements such as <input>, <label>, and <button>, along
with their associated attributes like class, label, and id. These elements are defined locally and
combined hierarchically to create HTML documents. To model this inherent hierarchy, we replace the
common dense attention (Vaswani et al., 2017) with local and global attention mechanisms (Ainslie
et al., 2020). Local attention restricts each token to only attend to neighboring tokens within a window.
Additionally, transient global attention allows each token to attend to tokens beyond its immediate
window. This is achieved through the aggregation and normalization of token representations within
each window, resulting in a global memory representation. Figure 4 describes the concepts of HTMLT5; leaf elements in HTML (green) could be processed by local attention, and internal elements
(purple) could be compressed into transient global attention, which naturally fits the hierarchical
structure of HTML. Following LongT5 (Guo et al., 2022), we use dense attention in the decoder.
Pre-Training with Mixture of Long-Span Denoising Our pre-training approach for HTML-T5
utilizes a span denoising objective. This involves randomly masking spans of tokens within an HTML
document, with span lengths drawn from a Gaussian distribution with a mean of µ. The objective
is then to predict the masked spans using the remaining tokens in the HTML document (Raffel
4 | https://arxiv.org/pdf/2307.12856.pdf |
4 | Published as a conference paper at ICLR 2024
Modules real-estate social-media map Error Ratio (%)
Plan Sum Success Score Success Score Success Score Program Plan Sum
Flan-U-PaLM % % 10.0 55.3 20.0 25.0 10.0 51.3 36 / 88 / 11 38 / 0 / 78 26 / 12 / 11
Flan-U-PaLM+P " % 50.0 79.5 20.0 38.3 30.0 73.8 39 / 65 / 14 56 / 30 / 29 5 / 5 / 57
Flan-U-PaLM+S % " 0.0 45.7 25.0 62.1 15.0 46.3 30 / 67 / 0 40 / 13 / 100 30 / 20 / 0
WebAgent " " 65.0 87.6 70.0 85.8 80.0 93.8 20 / 33 / 25 70 / 50 / 50 10 / 17 / 25
Table 1: Success rate of real-world web automation on real estate, social media and map websites. The score
stands for the percentage of covered attributes specified in given instructions. WebAgent, with language model
modules for planning and summarization, achieves the best success (65%, 70%, 80%, respectively), surpassing
other baselines, such as a single Flan-U-PaLM, that with a planning language model (Flan-U-PaLM+P), and
that with a summarization language model (Flan-U-PaLM+S). Without language model modules, prompted
Flan-U-PaLM plans in an open-loop manner (Plan: %) and regular-expression-based retrieval summarizes
HTML inputs (Sum: %). The results imply that self-experience supervision notably improves the performance,
and task planning should be learned by finetuning domain language models for closed-loop planning, rather
than by prompting single LLM for open-loop planning. The error analysis describes the ratio across three types
of errors in (real-estate) / (social-media) / (map) domains, which also points out that better adaptive
planner to decompose the given instructions would contribute to further improvements of WebAgent.
et al., 2020; Tay et al., 2022; Ainslie et al., 2023). While a span length of µ = 3 is commonly used,
such short spans often mask less meaningful chunks in HTML documents, such as </, id=, or
"> (Figure 4), where the signal-to-noise ratio can be significantly lower than natural language text. In
contrast, longer spans can contain more semantically meaningful chunks, such as <form class="
or type="submit">. Our empirical findings indicate that setting µ ∈ {8, 64} yields the optimal
mixture for HTML documents (Section 4.2).
We adopt 4096 input sequence length and 910 output sequence length during pre-training. In total,
15% of input tokens are randomly masked in the denoising objective. For the pre-training dataset, we
collect 100 WARC files (April 2019) from the CommonCrawl corpus and remove the non-Unicode or
alphanumeric-only HTML documents. We then extract subtrees around <label> elements that have
a special attribute called for that associates the corresponding label with a unique input element in
the same HTML document. This pre-processing step improves the quality of the pre-training corpus
by focusing only on HTML that is relevant for instruction following and grounding. Our final dataset
has 3.41M examples. We pre-train HTML-T5 for 100K iterations following the practice in other T5
models (Chung et al., 2022; Lester et al., 2021). See Appendix C for further details.
3.2 SELF-EXPERIENCE SUPERVISION FOR ALIGNMENT WITH REAL WEBSITES
Gathering example demonstrations for LLMs to understand websites poses a significant obstacle in
real-world web automation. While humans can effortlessly execute instruction following on actual
websites, manually annotating every planning, summarization, and program synthesis step as detailed
above is impractical. To address this issue, we propose self-experience supervision, a semi-supervised
approach that necessitates minimal human involvement. In this method, manually curated scripts
generate planning and summarization steps, while Flan-U-PaLM is tasked with generating Python
programs. Our WebAgent aligns domain-specific language models, such as HTML-T5, with these
self-gathered real-world experiences through fine-tuning (Wang et al., 2022b). This enables the
generalization and alignment of agents to complex real-world tasks.
Instruction Templates We maintain a collection of instruction templates that incorporate placeholders such as “Show me the way from <start> to <goal> by <n-th> <transportation> at
map website”. We sample instructions by randomly assigning values to placeholders from pre-defined
key-value pairs.
Scripted Planning and Prompted Programming We employ a rule-based parser to decompose
instructions into sequences of sub-instructions; corresponding reference IDs are retrieved from
HTML using regular expressions. At each step of the process, Flan-U-PaLM is provided with the
sub-instruction and the associated HTML snippets to generate navigation programs that are executed
through Selenium WebDriver. The success of recorded demonstrations varies, and automating
success criteria for real-world tasks remains challenging. To refine the learning experience, we utilize
environmental feedback to eliminate critical failures, such as program execution errors, retriever
errors, and clearly erroneous URL prefixes (Ni et al., 2023).
5 | https://arxiv.org/pdf/2307.12856.pdf |
5 | Published as a conference paper at ICLR 2024
map: Show me the way from San Jose to Mountain View by 2nd Cycling at map website?
# Type Mountain View into search
driver.find_element(By.CSS_SELECTOR,"...").clear()
driver.find_element(
By.CSS_SELECTOR,"..."
).send_keys("Mountain View")
# Type San Jose into starting point
driver.find_element(By.CSS_SELECTOR,"...").clear()
driver.find_element(
By.CSS_SELECTOR,"...").send_keys("San Jose")
# Click Cycling radio button
driver.find_element(
By.CSS_SELECTOR,"#Cycling").click()
# Click 2nd trip
driver.find_element(By.CSS_SELECTOR,"#trip1").click()
Figure 5: Example episodes of real-world web automation in map domain. Considering the given instruction and
HTML, WebAgent predicts the next sub-instruction and task-relevant snippet, and then synthesizes the Python
script (gray), while treating the sub-instruction as a comment in the script. See Appendix G for extended figure.
Finetuning for Planning and Summarization HTML-T5, a core component of WebAgent, is
fine-tuned using self-experience demonstrations gathered through instruction sampling, scripted
planning, and prompted program synthesis, as detailed earlier. It utilizes task instructions (e.g. please
search 2 bedroom and 2+ bathroom houses in new york, ny with a max price of $7500 on real estate
website), sub-instruction histories (e.g. go to real estate website, type in new york into search, click
on search, click on price, click on max rent), and raw HTML as inputs. Subsequently, it generates the
next sub-instruction (e.g. type in 7500 into max rent) and extracts the relevant data-ref attributes
used for retrieving HTML snippets. Section 4.1 demonstrates the significance of integrating HTML
summarization into sub-instruction prediction for enhancing real-world web automation performance.
3.3 GROUNDED PROGRAM SYNTHESIS
Web automation on real-world websites faces challenges due to the open-ended action spaces, unlike
simplified simulators (Shi et al., 2017; Yao et al., 2022a). In contrast to previous approaches (Gur
et al., 2019; Humphreys et al., 2022; Jia et al., 2019; Liu et al., 2018), real-world web agents cannot
pre-define a categorical action space to specify the interactive elements on the websites. To address
this open-domain challenge, we introduce the act via programming paradigm in web automation by
utilizing the conditional code generation capabilities of LLMs (Chen et al., 2021a; Liang et al., 2023).
Provided with few-shot generic examples (such as selecting checkboxes, entering text into inputs,
clicking on options, and scrolling etc.) for program generation, the next sub-instruction, and the
extracted HTML snippet from HTML-T5, Flan-U-PaLM (Chowdhery et al., 2022; Chung et al., 2022)
decodes an Python program (Figure 3) executable with Selenium WebDriver, a library for browser
automation. This conditional program synthesis requires LLMs to not only generate code to follow
natural language instructions but also understand the semantics and functionality of HTML elements.
We provide several Python snippet examples generated by Flan-U-PaLM as follows (sub-instructions
are treated as comments in the script):
1 # Type in walnut creek, ca into search
2 driver.find_element(By.CSS_SELECTOR, ’[data-ref="175"]’).clear()
3 driver.find_element(By.CSS_SELECTOR, ’[data-ref="175"]’).send_keys("walnut creek, ca")
4
5 # Submit the search
6 driver.find_element(By.CSS_SELECTOR, ’[data-ref="175"]’).submit()
7
8 # Click on the apartments
9 driver.find_element(By.CSS_SELECTOR, ’[data-ref="572"]’).click()
10
11 # Scroll down housing type by 200px
12 driver.execute_script(’getScrollParent(document.querySelector("#type-of-housing")).scrollBy({top: 200})’)
4 EXPERIMENTAL RESULTS
To study how a modular combination of LLMs under self-supervision enables real-world web automation by overcoming open-endedness and long context documents, we execute instruction-following
tasks on real websites (Section 4.1). In Appendix E, we also test WebAgent on WebSRC (Chen
et al., 2021b), a static HTML comprehension benchmark, compared to prior transformer models
specialized for structured documents (Li et al., 2021b; Zhao et al., 2022). In addition, we quantify
the performance of HTML-T5 itself on simulated web benchmark, MiniWoB++, and offline task
planning benchmark, Mind2Web (Section 4.2).
6 | https://arxiv.org/pdf/2307.12856.pdf |
6 | Published as a conference paper at ICLR 2024
Architectures Attention Type L = 2048 L = 4096
Flan-T5-Base Dense 34.0% 35.3%
Long-T5-Base Local 43.4% 44.0%
Long-T5-Base Local & Global 53.1% 53.6%
Span Length µ real-estate MiniWoB++
(no HTML-denoising) 78.07 53.8%
3,8,64,Prefix 80.56 55.2%
3,8,64 80.56 55.4%
8,64 82.46 57.0%
8,32,64 82.16 55.6%
8,64,96 81.29 53.6%
16,64 79.97 55.2%
Table 2: (Left) Architecture comparison on MiniWoB++ 12K dataset (Liu et al., 2018) with average success rate
over 56 tasks. Local and global attention matches to the hierarchical tree structure of HTML, and then improves
the success rate by over 18%, compared to the instruction-finetuned dense attentions (Chung et al., 2022; Furuta
et al., 2023). (Right) HTML-denoising comparison with different mixtures of span length (Raffel et al., 2020;
Tay et al., 2022). We use LongT5-Base models for pre-training. HTML-denoising generally improves the
performance on offline task planning on real estate website and MiniWoB benchmark. Especially, using longer
span lengths (µ ∈ {8, 6}) outperforms other choices, including the popular configuration in natural language
domain (µ ∈ {3, 8, 64} + Prefix LM objective), which can reduce the less meaningful prediction from shorter
spans (e.g. µ = 3), and inject the structural bias of HTML better.
4.1 REAL-WORLD WEB AUTOMATION
Evaluation Methodology We first evaluate WebAgent with the real-world navigation performance
under human supervision, at real estate website (a platform for housing), social media website
(a network of communities), and map website. These three websites have different properties.
real-estate requires long-horizon planning (about 20 steps per episode) for complex formfilling with a few page transitions (at least 2 pages), and social-media needs shorter plans (about
10 steps per episode) with many page transitions (at least 4 pages) by selecting appropriate hyperlinks
on the page. map is the easiest domain with shorter plans and a few page transitions. WebAgent
receives natural language instructions (e.g. Can you search for a studio bedroom, 1+ bathroom
apartments in oroville, ca for corporate housing on real estate website?, or Could you present the
most new thread of Python community filtered by Tutorial tag on social media website?), and acts via
planning, summarizing by HTML-T5, and then programming by Flan-U-PaLM (Figure 5). Through
the self-experience supervision process, we curate 260 episodes on real estate website, 230 episodes
on social media website, and 410 episodes on map website to finetune HTML-T5.
We prepare 20 different natural language instructions (see Appendix F for the full list), and measure
the success rate and score for the evaluation. The score represents the percentage of required attributes
covered during the episode (Yao et al., 2022a); for instance, (1) apartments for (2) corporate housing
with (3) studio bedroom and (4) 1+ bathroom located in (5) oroville, ca, can be specified in the
instruction. When the agents could search the housing satisfying (1), (2), (5) and not (3), (4), the
score is 60 (= 100 × 3/5). If the agents achieve 100 score, that episode will mark as success.
Results For comparison, we prepare three baselines, consisting of language model modules and a
single LLM conditioned on different prompts per role, such as Flan-U-PaLM (Chung et al., 2022),
that with a planning language model (Flan-U-PaLM+P), and that with a summarization language
model (Flan-U-PaLM+S). If they do not use language model modules, prompted Flan-U-PaLM plans
in an open-loop manner (Plan: %), and regular-expression-based retrieval summarizes given raw
HTML (Sum: %). Table 1 shows that by leveraging planning and summarization language model
modules, WebAgent achieves best 65% success and 87.6% score on real-estate, 70% success
and 85.8% score on social-media, and 80% success and 93.8% score on map, significantly
outperforming single Flan-U-PaLM, or with partial language model modules (most of those achieve
about 10 - 30% success). This result suggests that self-experience supervision notably improves
the performance, and closed-loop planning grounded on HTML observations via finetuned domain
language models is more suitable for open-ended web automation than open-loop planning with
few-shot LLMs. This trend is remarkable in real-estate (even Flan-U-PaLM+P achieves 50%
success), where the longer planning horizon is needed to fulfill instructions. We also observe that
coupling sub-instruction prediction with HTML summarization in language model modules plays a
critical role in task success. The development of more capable planning modules to decompose the
given instructions adaptively and accurately could help WebAgent improve the performance further.
Error Analysis We also analyze the reason of failures by categorizing them into programming,
planning, and summarization errors (Table 1). Programming error does not satisfy the given subinstructions or HTML snippet. Planning error predicts sub-instructions conflicting with user instruc7 | https://arxiv.org/pdf/2307.12856.pdf |
7 | Published as a conference paper at ICLR 2024
Cross-Task Cross-Website Cross-Domain
Train Ele. Acc Op. F1 Step SR SR Ele. Acc Op. F1 Step SR SR Ele. Acc Op. F1 Step SR SR
Synapse (GPT-3.5) ICL 34.4 – 30.6 2.0 28.8 – 23.4 1.1 29.4 – 25.9 1.6
MindAct (Flan-T5-XL) SL 55.1 75.7 52.0 5.2 42.0 65.2 38.9 5.1 42.1 66.5 39.6 2.9
MindAct (GPT-4) ICL 41.6 60.6 36.2 2.0 35.8 51.1 30.1 2.0 37.1 46.5 26.4 2.0
HTML-T5-XL (ours) SL 60.6 81.7 57.8 10.3 47.6 71.9 42.9 5.6 50.2 74.9 48.3 5.1
Table 4: Offline action prediction performance in Mind2Web dataset. We leverage the cached candidate generation results and direct QA formulation by following Deng et al. (2023). HTML-T5 significantly outperforms
MindAct with Flan-T5 or GPT-4, and Synapse (Zheng et al., 2023) with GPT-3.5, across task/website/domain
generalization in terms of all the metrics (element accuracy, operation F1, and success rates).
tions, and summarization error fails to extract the relevant HTML snippets for given sub-instructions.
From the website perspective, the failures on real-estate concentrate in planning because of its
long-horizon nature. map also fails in planning when confusing starting point and destination. In
contrast, social-media tends to fail in programming due to the ambiguous sub-instructions or
summarization including redundant hyperlinks, which results in transiting wrong pages or clicking
unexecutable elements. From the method perspective, WebAgent often fails in planning by predicting
incorrect sub-instructions (for instance, in real-estate, WebAgent generates incorrect plans in
70% of failure episodes), while other baselines more fail in programming or summarization steps.
This observation indicates that, through the self-experience supervision, the ratio of programming
and summarization errors has decreased while the fundamental difficulty of planning, which requires
consistent and accurate prediction over long horizon without error accumulation, still remains.
4.2 ABLATION OF HTML-T5
Models Data Success Diff.
SoTA (Zheng et al., 2023) – 99.2% –
CC-Net 2.4M 32.0% –
WebN-T5-XL 12K 48.4% –
LongT5-Base
12K
53.8% 0.0
LongT5-Large 56.3% 0.0
LongT5-XL 60.4% 0.0
Flan-LongT5-Base
12K
54.1% +0.3
Flan-LongT5-Large 56.1% -0.2
Flan-LongT5-XL 61.1% +0.7
HTML-T5-Base (ours)
12K
57.0% +3.2
HTML-T5-Large (ours) 60.8% +4.5
HTML-T5-XL (ours) 67.1% +6.7
Flan-T5-XL 347K 75.5% –
Flan-T5-XXL 79.0% –
HTML-T5-XL (ours) 347K 85.6% –
Table 3: Average success rate of MiniWoB++ with
56 tasks. We use 12K demonstrations and compare
HTML-T5 among supervised-finetuned methods.
HTML-T5-XL outperforms WebN-T5-XL (Gur
et al., 2022), the prior best method, by 18.7%.
HTML-denoising also yields better the success rate
than instruction tuned ones. Finetuned HTML-T5
with 347K episodes (Furuta et al., 2023) outperforms Flan-T5-XXL (11B parameters) even with
3B parameters, which gets closer to SoTA with
GPT-3.5. See Appendix J for the detailed results.
In addition to the evaluation as WebAgent system, we
extensively examine HTML-T5 about (i) the generalization to other websites with Mind2Web (Deng et al.,
2023), (ii) the performance on MiniWoB++, a standard web automation benchmark (Liu et al., 2018; Shi
et al., 2017), and (iii) its architecture and pre-training
objective. We adopt 16K tokens for the context window unless otherwise mentioned. We present results
on offline task planning, and description generation
(Gur et al., 2022) to test HTML understanding on
static dataset in Appendix H.
Mind2Web Mind2Web (Deng et al., 2023) is an
action-annotated real-world dataset with over 2K instructions collected from 137 websites. It provides
action prediction tasks that measure the generalization of LLMs across the tasks, websites, and their
domains (e.g. travel, shopping). Similar to real-world
evaluation, the input is a set of HTML snippets, a task
instruction, and an action history. The output comprises a target element to interact with, along with
the operation, such as click, type, or select an option.
We finetune HTML-T5-XL with the training dataset.
The performance is evaluated with element accuracy,
operation F1, and step success rate that cares for both element and operation correctness. Table 4 reveals that HTML-T5 significantly outperforms baselines with Flan-T5-XL or GPT-4 (OpenAI, 2023)
across task/website/domain generalization, which increases element accuracy by 5-8%, operation F1
by 6-8%, and step success rate by 4-8%. This highlights that HTML-T5 can handle real-world web
automation tasks better and shows generalization beyond our real-world evaluation with 3 websites.
MiniWoB++ We here evaluate HTML-T5 on 56 simulated tasks in MiniWoB++ using 100 evaluation
episodes per task. Inputs are analogous to real-world evaluation, utilizing HTML documents, while
outputs are adhering to a pre-defined format by the simulator such as click(ref = X). We finetune
HTML-T5 with 12K human demonstrations (Liu et al., 2018), and compare the average success
rate to prior supervised-learned agents (Gur et al., 2022; Humphreys et al., 2022), LongT5, and its
instruction-finetuned variants (Chung et al., 2022)
1
. Table 3 shows that HTML-T5-XL significantly
1We finetune LongT5 models with Flan dataset released by Chung et al. (2022). See Appendix I.
8 | https://arxiv.org/pdf/2307.12856.pdf |
8 | Published as a conference paper at ICLR 2024
outperforms WebN-T5, the prior best model, by 18.7%. Notably, we demonstrate HTML-denoising
consistently improves the performance on top of LongT5 in all the model sizes, better than instructionfinetuning introduced in prior work (Furuta et al., 2023). Furthermore, we finetune HTML-T5-XL
with 347K demonstrations from Furuta et al. (2023), which performs better than 11B-parameter FlanT5-XXL even with 3B parameters, achieving 85.6% success. These prove we successfully incorporate
domain knowledge on HTML comprehension for web automation into pre-trained language models.
Architecture and Objective We hypothesize that local and global attention mechanisms can capture
the hierarchical structures of HTML documents better than dense attention. We compare the web
automation performance among 56 MiniWoB++ tasks (Gur et al., 2022), by finetuning HTML-T5
with public 12K-episode dataset (Liu et al., 2018). We adopt 2048 and 4096 tokens as input length
and prepare Base-size architectures. Table 2 (left) reveals that the combination of local and global
attentions achieves the superior success rate by over 18% compared to the instruction-finetuned
dense attentions (Chung et al., 2022; Raffel et al., 2020) and local attention only. Surprisingly, local
attention only still surpasses the dense attention by about 9%, which suggests local relation between
elements and attributes in HTML are essential for web tasks.
As for pre-training objective in Table 2 (right), HTML-denoising generally improves the performance
on offline task planning on real estate website and MiniWoB. Especially, using only longer span
lengths (µ ∈ {8, 64}) outperforms other choices, including the popular configuration in natural
language domain (µ ∈ {3, 8, 64} + Prefix LM objective), which can reduce the less meaningful
prediction from shorter spans (e.g. µ = 3), and inject the structural bias of HTML into language
models better. See Appendix H.2 for further results with model scaling.
5 DISCUSSION AND LIMITATION
Modular Approach with Specialist Language Models We demonstrate it is beneficial to divide
web automation into planning, HTML summarization, and code generation, and to combine domainexpert language models aligned with self-experience data. Such modular approaches have also been
adopted to support the inference of LLMs (Xu et al., 2023), multimodal tasks (Zeng et al., 2022), and
robotics (Ahn et al., 2022), which, however, might cause additional computational costs and latency.
Broad Generalization across the Internet Because open-loop planning with prompted Flan-UPaLM achieves at most 10 - 30% success, we have demonstrated that self-experience supervision on
real websites is essential for planning modules. As we demonstrated in Mind2Web, our method could
generalize across the internet if we have enough data. It would be expected to collect demonstrations
at scale and align larger domain-expert models with them in future works.
Feedback for Program Synthesis We leverage Flan-U-PaLM with 540B parameters, as a capable
program synthesis module via few-shot prompting. Such a large model, however, makes it challenging
to reflect the feedback about the errors in generated code, compared to smaller models. We leave it as
future direction to incorporate the feedback for program synthesis into larger language models.
Evaluation for Real-world Web Automation Beyond the simulated web environments (Shi et al.,
2017; Yao et al., 2022a), we have exhibited WebAgent can follow given complex and sometimes
ambiguous instructions on real estate, social media and map websites. On the other hand, it is costly
to evaluate the performance of autonomous agents in the real world. Automated evaluation with
minimal human intervention would be helpful for the scalable development of real-world web agents.
6 CONCLUSION
We build a system for real-world web automation, combining HTML-T5 for planning and HTML
summarization and Flan-U-PaLM for grounded program synthesis. Our proposed WebAgent achieves
around 70-80% success on real websites via self-experience supervision, outperforming single LLM
approach by over 50%, which suggests dividing the sequence of sub-problems with multiple language
models can increase the entire task success. We also propose a scalable recipe for HTML-specialized
language models where we train local and global attention mechanisms with a mixture of long-span
denoising objectives to capture the hierarchical structures of HTML documents. HTML-T5 not only
plays an essential role in WebAgent but also can achieve the best results on a variety of HTML-based
benchmarks such as Mind2Web and MiniWoB++. We hope our work contributes to getting us
one-step closer to the practical deployment of autonomous web agent systems.
9 | https://arxiv.org/pdf/2307.12856.pdf |
9 | Published as a conference paper at ICLR 2024
ETHICS STATEMENT
This paper presents encouraging evidence of autonomous agents’ potential for deployment on real
websites, extending beyond simulated environments. In the foreseeable future, this technology could
lead to the development of sophisticated AI assistant tools for computers and smartphones, enhancing
productivity and accessibility for society.
While we recognize the promising aspects of autonomous agents, we must also consider the potential
for misuse and unintended consequences in their development. As our proposed system is based
on LLMs, there is a risk of prompt injection. The improper use of web automation could pose
cybersecurity threats and expose users to scams. To mitigate these risks, it is crucial for researchers,
policymakers, and industry stakeholders to collaborate on establishing guidelines and regulations for
the development of autonomous agents. Additionally, security research focused on LLM agents will
become an essential domain for society.
ACKNOWLEDGMENTS
We thank Heiga Zen, Yingjie Miao, Yusuke Iwasawa, Joshua Ainslie, Santiago Ontanon, Quoc V. Le,
Zoubin Ghahramani, Jeff Dean, Tris Warkentin for the supports and advises on this work. HF was
supported by JSPS KAKENHI Grant Number JP22J21582.
REFERENCES
Leonard Adolphs, Benjamin Boerschinger, Christian Buck, Michelle Chen Huebscher, Massimiliano
Ciaramita, Lasse Espeholt, Thomas Hofmann, Yannic Kilcher, Sascha Rothe, Pier Giuseppe Sessa,
and Lierni Sestorain Saralegui. Boosting search engines with interactive agents. In Transactions
on Machine Learning Research, 2022.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea
Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine
Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally
Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee,
Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka
Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander
Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy
Zeng. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint
arxiv:2204.01691, 2022.
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. Etc: Encoding long and structured
inputs in transformers. arXiv preprint arXiv:2004.08483, 2020.
Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury Zemlyanskiy,
David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, and Sumit Sanghai. Colt5:
Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752,
2023.
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos,
Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark,
Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark
Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang,
Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury,
Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A.
Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa
Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad
Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari,
Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz,
Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun,
Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang
Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni,
10 | https://arxiv.org/pdf/2307.12856.pdf |
10 | Published as a conference paper at ICLR 2024
Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John
Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov,
Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy,
Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So,
Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang,
Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting
Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny
Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report. arXiv preprint arXiv:2305.10403,
2023.
Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. Docformer:
End-to-end transformer for document understanding. In International Conference on Computer
Vision, 2021.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,
Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large
language models. arXiv preprint arXiv:2108.07732, 2021.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint
arXiv:2005.14165, 2020.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri,
Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan,
Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian,
Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios
Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino,
Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders,
Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,
Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob
McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating
large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, and Kai Yu.
WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing, pp. 4173–4185, 2021b.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh,
Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam
Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James
Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin
Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph,
Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M.
Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon
Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean,
Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. arXiv preprint
arXiv:2204.02311, 2022.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai,
Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu,
Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob
Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned
language models. arXiv preprint arxiv:2210.11416, 2022.
11 | https://arxiv.org/pdf/2307.12856.pdf |
11 | Published as a conference paper at ICLR 2024
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
2021.
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang,
and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long
documents. arXiv preprint arXiv:1804.05685, 2018.
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and
Yu Su. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070,
2023.
Oscar Diaz, Itziar Otaduy, and Gorka Puente. User-driven automation of web form filling. In
International Conference on Web Engineering, 2013.
Alexander R. Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R. Radev. Multi-news: a largescale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint
arXiv:1906.01749, 2019.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing
Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and
natural languages. arXiv preprint arXiv:2002.08155, 2020.
Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin
Gur. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint
arxiv:2305.11854, 2023.
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and
Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2023.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle
use a laptop? a question answering benchmark with implicit reasoning strategies. arXiv preprint
arXiv:2101.02235, 2021.
Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and
Yinfei Yang. LongT5: Efficient text-to-text transformer for long sequences. In Findings of the
Association for Computational Linguistics: NAACL 2022, pp. 724–736, 2022.
Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek Hakkani-Tur. Learning to navigate the
web. In International Conference on Learning Representations, 2019.
Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery,
Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language
models. arXiv preprint arxiv:2210.03945, 2022.
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert
with disentangled attention. In International Conference on Learning Representations, 2021.
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin
Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge
competence with apps. arXiv preprint arXiv:2105.09938, 2021a.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
Steinhardt. Measuring massive multitask language understanding. In International Conference on
Learning Representations, 2021b.
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot
planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207,
2022.
Peter C Humphreys, David Raposo, Toby Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair
Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, and Timothy Lillicrap. A
data-driven approach for learning to control computers. In International Conference on Machine
Learning, 2022.
12 | https://arxiv.org/pdf/2307.12856.pdf |
12 | Published as a conference paper at ICLR 2024
Sheng Jia, Jamie Ryan Kiros, and Jimmy Ba. DOM-q-NET: Grounded RL on structured language.
In International Conference on Learning Representations, 2019.
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks.
arXiv preprint arxiv:2303.17491, 2023.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. In Advances In Neural Information Processing Systems,
2022.
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, pp. 3045–3059, November 2021.
Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. Structurallm:
Structural pre-training for form understanding. arXiv preprint arXiv:2105.11210, 2021a.
Junlong Li, Yiheng Xu, Lei Cui, and Furu Wei. Markuplm: Pre-training of text and markup language
for visually-rich document understanding. arXiv preprint arxiv:2110.08518, 2021b.
Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha,
and Hongfu Liu. Selfdoc: Self-supervised document representation learning. In Conference on
Computer Vision and Pattern Recognition, 2021c.
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences. In Annual Conference of the Association for Computational
Linguistics, 2020.
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Remi Leblond, Tom
Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien
de, Masson dAutume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven
Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson,
Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level
code generation with alphacode, 2022.
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and
Andy Zeng. Code as policies: Language model programs for embodied control. arXiv preprint
arXiv:2209.07753, 2023.
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization
Branches Out, pp. 74–81. Association for Computational Linguistics, July 2004.
Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone.
Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint
arXiv:2304.11477, 2023.
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on
web interfaces using workflow-guided exploration. In International Conference on Learning
Representations, 2018.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V.
Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods
for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin
Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou,
Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu
Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understanding
and generation. arXiv preprint arXiv:2102.04664, 2021.
13 | https://arxiv.org/pdf/2307.12856.pdf |
13 | Published as a conference paper at ICLR 2024
Sahisnu Mazumder and Oriana Riva. Flin: A flexible natural language interface for web navigation.
arXiv preprint arXiv:2010.12844, 2020.
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing
english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. Summarunner: A recurrent neural network based
sequence model for extractive summarization of documents. arXiv preprint arXiv:1611.04230,
2016.
Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I Wang, and Xi Victoria
Lin. Lever: Learning to verify language-to-code generation with execution. In International
Conference on Machine Learning, 2023.
Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer
Singh, and Roy Fox. Do embodied agents dream of pixelated sheep: Embodied decision making
using language guided world modelling. In International Conference on Machine Learning, 2023.
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton,
Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and
Ryan Lowe. Training language models to follow instructions with human feedback. arXiv preprint
arxiv:2203.02155, 2022.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association
for Computational Linguistics.
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint
arXiv:2205.12255, 2022.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math
word problems? arXiv preprint arXiv:2103.07191, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel
Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor
Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini
Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis
Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan
Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma,
Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan
Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling up models and
data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,
Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to
use tools. arXiv preprint arXiv:2302.04761, 2023.
Eva Sharma, Chen Li, and Lu Wang. Bigpatent: A large-scale dataset for abstractive and coherent
summarization. arXiv preprint arXiv:1906.03741, 2019.
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi
Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to ui actions: Learning to follow
instructions via graphical user interfaces. arXiv preprint arXiv:2306.00245, 2023.
14 | https://arxiv.org/pdf/2307.12856.pdf |
14 | Published as a conference paper at ICLR 2024
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An
open-domain platform for web-based agents. In International Conference on Machine Learning,
2017.
Maayan Shvo, Zhiming Hu, Rodrigo Toro Icarte, Iqbal Mohomed, Allan D. Jepson, and Sheila A.
McIlraith. Appbuddy: Learning to accomplish tasks in mobile apps via reinforcement learning. In
Canadian Conference on Artificial Intelligence, 2021.
Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B. Tenenbaum, Leslie Pack Kaelbling, and Michael
Katz. Generalized planning in pddl domains with pretrained large language models. arXiv preprint
arXiv:2305.11014, 2023.
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter
Fox, Jesse Thomason, and Animesh Garg. ProgPrompt: Generating situated robot task plans using
large language models. arXiv preprint arXiv:2209.11302, 2022.
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive
planning from feedback with language models. arXiv preprint arXiv:2305.16653, 2023.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung,
Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging bigbench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,
2022.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question
answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2019.
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won
Chung, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald
Metzler. Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven
Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin,
James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent
Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh
Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi,
Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran,
Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora
Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron
Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi,
and Quoc Le. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239,
2022.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand
Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language
models. arXiv preprint arxiv:2302.13971, 2023.
Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed,
Tyler Jackson, Shibl Mourad, and Doina Precup. Androidenv: A reinforcement learning platform
for android. arXiv preprint arXiv:2105.13231, 2021.
Dweep Trivedi, Jesse Zhang, Shao-Hua Sun, and Joseph J. Lim. Learning to synthesize programs as
interpretable and generalizable policies. arXiv preprint arXiv:2108.13643, 2022.
Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language
models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv
preprint arXiv:2206.10498, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, 2017.
15 | https://arxiv.org/pdf/2307.12856.pdf |
15 | Published as a conference paper at ICLR 2024
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and
Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv
preprint arXiv:2305.16291, 2023a.
Qifan Wang, Yi Fang, Anirudh Ravula, Fuli Feng, Xiaojun Quan, and Dongfang Liu. Webformer:
The web-page transformer for structure information extraction. arXiv preprint arXiv:2202.00217,
2022a.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and
Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions.
arXiv preprint arXiv:2212.10560, 2022b.
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified
pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696–8708, 2021.
Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and
select: Interactive planning with large language models enables open-world multi-task agents. In
International Conference on Machine Learning, 2023b.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le,
and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv
preprint arXiv:2201.11903, 2022.
Canwen Xu, Yichong Xu, Shuohang Wang, Yang Liu, Chenguang Zhu, and Julian McAuley. Small
models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848, 2023.
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. LayoutLM: Pretraining of text and layout for document image understanding. arXiv preprint arxiv:1912.13318,
2019.
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable
real-world web interaction with grounded language agents. arXiv preprint arxiv:2207.01206,
2022a.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.
React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,
2022b.
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke,
and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language.
arXiv preprint arXiv:2204.00598, 2022.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. Pegasus: Pre-training with extracted
gap-sentences for abstractive summarization. In International Conference on Machine Learning,
2020.
Zihan Zhao, Lu Chen, Ruisheng Cao, Hongshen Xu, Xingyu Chen, and Kai Yu. TIE: Topological
information enhanced structural reading comprehension on web pages. In Proceedings of the
2022 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pp. 1808–1821, 2022.
Longtao Zheng, Rundong Wang, and Bo An. Synapse: Leveraging few-shot exemplars for humanlevel computer control. arXiv preprint arXiv:2306.07863, 2023.
Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng. Mediasum: A large-scale media interview
dataset for dialogue summarization. arXiv preprint arXiv:2103.06410, 2021.
16 | https://arxiv.org/pdf/2307.12856.pdf |
16 | Published as a conference paper at ICLR 2024
APPENDIX
A NOTE FOR REAL-WORLD EVALUATION
The development of autonomous agents should consider the security and safety aspects. In the real
website evaluation, we have carefully conducted the experiments under human supervision in case
undesired behaviors happen. We use Selenium WebDriver 2
, a popular library for browser automation,
and limit the access per second not to stress the server. We have anonymized the real websites we
tested on for safety and privacy concerns.
B EXTENDED RELATED WORKS
Document Understanding Understanding structural documents has been a practical challenge for
transformer-based language models. Prior works employ layout-informed tokens (Xu et al., 2019) or
even multimodal tokens from visual inputs (Appalaraju et al., 2021; Li et al., 2021a;c). Especially, for
the documents written in markup languages, text-XPath alignment (Li et al., 2021b), token separation
between text and HTML (Wang et al., 2022a), or extra topological information of HTML (Zhao et al.,
2022) are proposed to leverage their syntax better. On the other hand, such a domain knowledge
conflicts with recent generalist and scaling trends around LLMs (Anil et al., 2023; OpenAI, 2023).
Because web agents require the instruction-conditioned HTML understanding, it also would be
desirable to reconcile specialist aspects for HTML documents with generalist capabilities for natural
language tasks. In this work, we design HTML-T5 to incorporate the structural bias of HTML by
combining local-global attention for the encoder and a mixture of long-span denoising, while it can
solve instruction-following better in downstream web-based tasks.
LLM for Task Planning The prior knowledge of commonsense in LLMs has allowed us to leverage
them for a variety of task planning. For instance, Huang et al. (2022) propose LLM agent that
generates natural language plans in an open-loop manner. Nottingham et al. (2023) and Wang et al.
(2023b) perform sequential closed-loop planning on MineCraft. Singh et al. (2022) decode robotic
plans with pythonic text, and several works incorporate planning definition and domain language
into the outputs (Liu et al., 2023; Silver et al., 2023; Valmeekam et al., 2023). On the other hand,
our WebAgent leverages finetuned specialist language models and performs closed-loop planning
coupled with HTML summarization by decomposing given instructions. We empirically prove that
our system is superior to open-loop planning with a single generalist LLM with prompting.
2https://www.selenium.dev/
17 | https://arxiv.org/pdf/2307.12856.pdf |
17 | Published as a conference paper at ICLR 2024
C IMPLEMENTATION DETAILS OF HTML-T5
We use the implementation of local and global attentions released by Guo et al. (2022)
3
. Following
Guo et al. (2022), we set the local radius to r = 127, and block size for transient global attention to
k = 16. For the pre-training objective, similar to Tay et al. (2022), we construct the mixtures and
then use long mean span lengths: µ ∈ {8, 64}, and all the denoising ratio (percentage of masked
tokens in the input sequence) is set to 0.15. We adopt 4096 input sequence length and 910 output
sequence length during the pre-training. The batch size for training is set to 128. We train the models
with 100K iterations following other pre-training strategies for T5 families (Chung et al., 2022; Lester
et al., 2021). We leverage SeqIO (Roberts et al., 2022) and T5X (Roberts et al., 2022) library to
manage the training pipeline. We also use SentencePiece (Kudo & Richardson, 2018) with 32K
tokens from C4 dataset (Raffel et al., 2020) as a tokenizer. During the downstream finetuning, we
adopt 16K tokens for the context window unless otherwise mentioned. We have used cloud TPU-v3,
which has a 32 GiB HBM memory space, with 128 cores for the experiments.
For the dataset, we prepare 100 WARC files (April 2019) from CommonCrawl4
, and pre-process the
raw HTML by removing non-Unicode and alphanumeric documents and extracting subtrees around
<label> elements that have for attribute, to reduce the noise in training corpus, which results in
about 3.41M examples (Table 5).
# of examples # of tokens
Average 90th Max
3.41M 1020 4566 7627
Table 5: Statistics of CommonCrawl HTML corpus for self-supervised denoising pre-training of HTML-T5.
Input lengths are measured in tokens by Kudo & Richardson (2018).
D WEBAGENT EXAMPLE FLOW IN R E A L-E S T A T E WEBSITE
Previous Planning Steps
Go to realestatewebsite.com
Previous Snippet IDs
-
Instruction
Can you find me a 1 bedroom apartment in San Diego that has a fitness center?
HTML Document
<html data-ref="0" id="realEstateApp"><body data-ref="11"><div
data-ref="12"><header data-ref="13" id="topHeader"><div data-ref="14"><div
data-ref="15" id="menu"><button data-ref="16" id="headerMenuItem"
type="button"><i data-ref="17" id="topHeaderIcon"></i><label data-ref="18"
id="labelForMenu">Menu</span …
Sub-Instruction
Type in San Diego into search.
Snippet References
data-ref=129, data-ref=156
Program
# Type in san diego, ca into search
driver.find_element(By.CSS_SELECTOR,
'#quickSearchLookup[is-snippet="true"][webcoder-visibility="100"]').clear()
driver.find_element(By.CSS_SELECTOR,
'#quickSearchLookup[is-snippet="true"][webcoder-visibility="100"]').send_keys(“sa
n diego, ca”)
History of
previous
predictions
Language
instruction
and HTML
page
Planning
and
Snippets
Python
program in
Selenium
Inputs Outputs
Figure 6: An example flow with planning, summarization, and grounded program synthesis in the real estate
website. HTML-T5 iteratively predicts a decomposed sub-instruction and task-relevant snippet (orange) in a
closed-loop manner, conditioning on the HTML documents, instruction (yellow), and history of past predictions
(green). Flan-U-PaLM is prompted with sub-instruction and snippet (orange) to decode python programs (blue).
3https://github.com/google-research/longt5
4https://commoncrawl.org/
18 | https://arxiv.org/pdf/2307.12856.pdf |
18 | Published as a conference paper at ICLR 2024
E WEBSRC: STATIC HTML COMPREHENSION
To emphasize the advantage of our modular approach, we test WebAgent on a static website comprehension benchmark, WebSRC (Chen et al., 2021b), which is a contextual QA dataset with HTML
documents. The questions require an understanding of the spatial and logical structure of websites,
and the answers are either text span on HTML or yes/no. For the comprehensive evaluation, WebSRC
has three different types of websites, KV, Comparison, and Table. KV task is a value extraction
from the attribute key. Comparison task has several entities with the same attributes. Table task
requires a structural understanding with header columns and values in the row. We finetune HTML-T5
for snippet extraction to predict data-ref corresponding to the answer and use dev set for the
evaluation.
As did in real-world web automation, HTML-T5 first predicts data-ref attribute of task-relevant
snippet from the input HTML document. To make sure there is enough context, we extract the
snippet from the predicted element to the two-level-up via XPath. If it exceeds the context length
of Flan-U-PaLM, we limit it into parent elements. If it still does not work, we truncate the end of
extracted snippet to fit within the token budget. Because snippet extraction in table structure often
loses the context to solve question-answering, we just truncate HTML documents for Table tasks.
Flan-U-PaLM predicts the answers seeing 5-shot examples.
As shown in Table 6, single LLM, such as Flan-U-PaLM or HTML-T5, has struggled to the limited
context length or model capacity. In contrast, WebAgent, our LLM-collaborative approach, enhances
the performance from both single generalist and specialist LLMs, and shows competitive results
with strong baselines. This demonstrates that modular LLMs work complementarily to each other.
Figure 7 presents the performance comparison on different types of websites (KV, Comparison, Table)
among MarkupLM (Li et al., 2021b), TIE (Zhao et al., 2022), and WebAgent. WebAgent is better at
Comparison tasks, but inferior to structural understanding for KV and Table tasks, compared to other
baselines, which suggest that generalist LLMs are still not suitable for recognizing structural data
such as table.
Models EM F1
T-PLM (Chen et al., 2021b) 61.67 69.85
H-PLM (Chen et al., 2021b) 70.12 74.14
V-PLM (Chen et al., 2021b) 73.22 76.16
MarkupLM-Large (Li et al., 2021b) 74.43 80.54
TIE-Large (Zhao et al., 2022) 81.66 86.24
Flan-U-PaLM 40.01 47.56
HTML-T5-Large 73.09 76.66
HTML-T5-XL 74.73 78.73
WebAgent 75.50 85.75
WebAgent (oracle) 76.91 86.64
Table 6: Evaluation on WebSRC (Chen et al., 2021b) with dev set. WebAgent, our collaborative LLMs, enhances
the performance from both single generalist (Flan-U-PaLM) or specialist LLMs (HTML-T5). WebAgent (oracle)
uses oracle snippets that are guaranteed to include the answers, instead of those predicted by finetuned HTML-T5. KV Compare Table
EM
60
65
70
75
80
85
80.32
72.28
61.98
79.63
82.55 82.28
79.34
84.75
64.49
KV Compare Table
F1
65
70
75
80
85
90
95
87.73
83.24
67.8
86.29
83.95
85.28
85.9
90.82
80.6
MarkupLM TIE WebAgent Figure 7: The performance comparison on different types of websites in WebSRC dev set.
19 | https://arxiv.org/pdf/2307.12856.pdf |
19 | Published as a conference paper at ICLR 2024
F LIST OF LANGUAGE INSTRUCTIONS FOR REAL-WORLD WEB AUTOMATION
real-estate
1. can you search for a studio bedroom, 1+ bathroom houses in escondido, ca for corporate housing and price
less than 12100 on real estate website.
2. can you find me a studio bedroom, 1+ bathroom townhomes in hollywood, ca and price less than 14600 on
real estate website.
3. can you search for a studio bedroom, 1+ bathroom condos in inglewood, ca for senior housing and price less
than 8700 on real estate website.
4. I would like to search for a studio bedroom, 1+ bathroom houses in compton, ca and price more than 1200
for corporate housing on real estate website.
5. can you search for a studio bedroom, 1+ bathroom apartments in oroville, ca for corporate housing on real
estate website.
6. find me a studio bedroom, 1+ bathroom houses in modesto, ca on real estate website.
7. can you search for a studio bedroom, 1+ bathroom condos in redwood city, ca for student and price more
than 1900 on real estate website.
8. find me a 1 bedroom condos in santa clara, ca and price between 1600 and 7400 on real estate website.
9. find me a 1 bedroom, 3+ bathroom apartments in martinez, ca with min price 1800 on real estate website.
10. can you find me a 2 bedroom, 2+ bathroom townhomes in concord, ca and price more than 600 on real estate
website.
11. can you find me a studio bedroom, 2+ bathroom apartments in san diego, ca and price less than 9300 on real
estate website.
12. find me a studio bedroom houses in novato, ca and price between 1500 and 6700 on real estate website.
13. can you find me a studio bedroom, any bathroom townhomes in petaluma, ca and price more than 1000 on
real estate website.
14. search for a 1 bedroom apartments in modesto, ca and price more than 1000 on real estate website.
15. find me a 1 bedroom, 2+ bathroom apartments in watts, ca for senior housing less than 6300 on real estate
website.
16. can you find me a 1 bedroom houses in victorville, ca that have dog friendly, furnished and price more than
700 on real estate website.
17. I need a 2 bedroom, any bathroom condos in inglewood, ca and price more than 1000 on real estate website.
18. find me a 2 bedroom, 2+ bathroom apartments in livermore, ca on real estate website.
19. can you find me a 2 bedroom apartments in santa clara, ca that has parking and price less than 10300 on real
estate website.
20. can you search for a 2 bedroom condos in oakland, ca on real estate website.
social-media
1. Show me the most hot thread in r/google at social media website.
2. Can you point out the most hot thread in r/learnpython at social media website.
3. Could you check the 1st hot thread in r/artificial at social media website.
4. Can I check the most hot thread in Taiwan on social media website.
5. Show me the first new thread in r/facebook at social media website.
6. Present the most new thread of r/Python filtered by Tutorial flair on social media website.
7. Could you check the 1st new thread in r/facebook at social media website.
8. I want to read the 1st hot thread from r/Python tagged as Daily Thread at social media website.
9. Present the most hot thread of r/google filtered by Info | Mod Post flair on social media website.
10. Show me the most new thread in r/learnmachinelearning filtered by Help flair at social media website.
11. Can you point out the first hot thread in r/deeplearning at social media website.
12. Could you check the 1st hot thread in r/machinelearningnews at social media website.
13. Present the most hot thread of r/artificial filtered by News flair on social media website.
14. Please find me the first hot thread in r/facebook at social media website.
15. Present the most new thread of r/machinelearningnews filtered by Startup News flair on social media website.
16. Show me the most hot thread in r/artificial filtered by AI Art flair at social media website.
17. Could you check the first new thread in r/facebook at social media website.
18. I want to read the most top thread from r/google tagged as Info | Mod Post at social media website.
19. Show me the most new thread in r/startups filtered by Share Your Startup flair at social media website.
20. Could you check the 2nd new thread in r/facebook at social media website.
20 | https://arxiv.org/pdf/2307.12856.pdf |
20 | Published as a conference paper at ICLR 2024
map
1. Show me the way from San Jose to Mountain View by 2nd Cycling at map website.
2. Please show me the way from The Painted Ladies to San Francisco Zoo with 3rd Best option at map website.
3. Could you tell me the path from California Academy of Sciences to de Young Museum by 1st Transit at map
website.
4. Could you tell me the way from Union Square to The Painted Ladies with 2nd Cycling option at map
website.
5. Please present the way from Chappell Hayes Observation Tower to San Jose with 2nd Walking option at
map website.
6. Please present the path from Jack London Square to Emeryville by 2nd Cycling at map website.
7. I’d like to move The Midway from Children’s Fairyland by 1st Cycling at map website.
8. I’d like to move Chase Center from San Francisco - Oakland Bay Bridge with 2nd Transit option at map
website.
9. I want to move Pier 39 from Berkeley by 3rd Cycling at map website.
10. I want to go to Emeryville from Mountain View with 2nd Cycling option at map website.
11. Can you point out the way from San Mateo to Stanford University by 2nd Cycling at map website.
12. Could you point out the way from Palace of Fine Arts to UC Berkeley by 1st Cycling at map website.
13. Point out the way from The Painted Ladies to San Francisco Museum of Modern Art by 2nd Driving at map
website.
14. Could you find the path from Union Square to Palo Alto by 1st Cycling at map website.
15. Please check the way from San Jose to San José Mineta International Airport with 1st Walking at map
website.
16. Check the path from San Francisco Zoo to Berkeley with 1st Cycling at map website.
17. I’d like to check Parking Lots along the way from Stanford University to The Painted Ladies with Best
option at map website.
18. Check Gas stations along the way from de Young Museum to Oakland with Driving option at map website.
19. Please show me Hotels along the way from Palace of Fine Arts to Berkeley by Transit at map website.
20. Check Gas stations along the way from Bay Area Discovery Museum to Santa Cruz with Best option at map
website.
G EXAMPLE EPISODE IN REAL-WORLD WEB AUTOMATION
21 | https://arxiv.org/pdf/2307.12856.pdf |
21 | Published as a conference paper at ICLR 2024
map: Show me the way from San Jose to Mountain View by 2nd Cycling at map website?
# Go to map website
driver.get("https://www.(map website).com/")
# Type Mountain View into search
driver.find_element(By.CSS_SELECTOR,"...").clear()
driver.find_element(
By.CSS_SELECTOR,"..."
).send_keys("Mountain View")
# Type San Jose into starting point
driver.find_element(By.CSS_SELECTOR,"...").clear()
driver.find_element(
By.CSS_SELECTOR,"...").send_keys("San Jose")
# Click Cycling radio button
driver.find_element(
By.CSS_SELECTOR,"#Cycling").click()
# Click 2nd trip
driver.find_element(By.CSS_SELECTOR,"#trip1").click()
Figure 8: Example episodes of real-world web automation in map domain.
22 | https://arxiv.org/pdf/2307.12856.pdf |
22 | Published as a conference paper at ICLR 2024
H EXTENSIVE ABLATION OF HTML-T5
H.1 DATASET AND INITIALIZATION
To test our recipe described in Section 2.1, we compare the different dataset and model initialization
for pre-training on downstream task performances; offline task planning on real-estate and
average success rate on MiniWoB with 12K dataset. We use Base-size models for the experiments.
For HTML-denoising, we prepare the corpus from CommonCrawl with (Extracted) or without (Raw)
subtree extraction around label elements on the documents. We also compare the initialization of base
architectures before HTML-denoising; from scratch or with pre-trained models on PEGASUS objective (Zhang et al., 2020) that is a masked important sentence prediction from long-context paragraph.
Table 7 reveals that snippet extraction on HTML corpus improves downstream performances since
such a pre-processing can reduce the noise in raw HTML. Moreover, initialization with PEGASUS
pre-trained weights is essential for HTML-T5, because of the long-context and instruction-following
nature of HTML-based tasks.
CC-HTML PEGASUS real-estate MiniWoB++
Raw " 80.56 56.7%
Extracted % 67.11 49.1%
Extracted " 82.46 57.0%
Table 7: Ablations of HTML-T5-Base on dataset quality and initialization. We evaluate offline task planning on
real-estate and average success rate on MiniWoB with 12K dataset. For HTML-denoising, we prepare
HTML corpus from CommonCrawl with (Extracted) or without (Raw) subtree extraction around label elements.
We also compare the pre-training of base architectures with PEGASUS objective (Zhang et al., 2020) before
HTML-denoising. The results imply that PEGASUS pre-training is critical for the architectures and preprocessing with subtree extraction improves the downstream performance on HTML-based tasks.
H.2 OFFLINE EVALUATION ON TASK PLANNING WITH MODEL SCALING
We compere the offline task planning performance between HTML-T5 and LongT5 (without HTMLdenosing) with different model sizes; with Base (220M parameters), Large (770M parameters), and
XL (3B parameters). As described in Section 3.1, the models predict the next sub-instructions in
a closed-loop manner considering the current HTML observations, user instructions, and previous
sub-instruction histories as inputs. For offline task planning evaluation, we use the demonstrations on
real-estate website; preparing 130 demonstrations and splitting them into train (90%) and test
splits (10%). We report the best per-step exact match accuracy in test set.
Table 8 shows that HTML-T5 outperforms LongT5 on the accuracy of sub-instruction prediction,
which demonstrates that HTML-denoising pre-training captures the structural bias of HTML better
without sacrificing the ability to understand natural language instructions. This also implies that our
proposed HTML-denoising can scale to larger-size models consistently.
Models real-estate Diff.
LongT5-Base 78.07 0.0
LongT5-Large 82.89 0.0
LongT5-XL 81.29 0.0
HTML-T5-Base 82.46 +4.39
HTML-T5-Large 83.63 +0.74
HTML-T5-XL 83.92 +2.63
Table 8: Accuracy of offline evaluation on task planning. We leverage the demonstrations in real-estate
websites. Compared to original LongT5, and as we scale model size, HTML-T5 improves the accuracy of
sub-instruction prediction.
23 | https://arxiv.org/pdf/2307.12856.pdf |
23 | Published as a conference paper at ICLR 2024
H.3 DESCRIPTION GENERATION
We also investigate the capability of HTML-T5 on static HTML comprehension tasks, as well as
interactive decision making tasks. We use Description Generation benchmark (Gur et al., 2022), where
the models generate the textual description of elements, typically used for accessibility purposes and
annotated with a special attribute in the HTML schema known as for. We evaluate the understanding
the structure of HTML as it would appear to a user, despite not having access to the rendered website
directly.
We compare LaMDA (Thoppilan et al., 2022), T5, LongT5, and HTML-T5 with respect to accuracy,
BLEU (Papineni et al., 2002), and ROUGE-1 (Lin, 2004) score. As shown in Table 9, local and
global attention mechanisms, underlying between LongT5 and HTML-T5, could almost solve the
benchmark by improving the previous best performance by over 10%, with still improved performance
as model size increases. Compared to the effect of local-global attention, HTML-T5 marginally
improves against LongT5, which emphasizes that local and global attentions are critical to capture
the hierarchical structure of HTML documents.
Dev Test
Models Accuracy BLEU ROUGE-1 Accuracy BLEU ROUGE-1
LaMDA-1B (Gur et al., 2022) 83.3 87.5 90.2 84.3 88.6 91.2
T5-Large (Gur et al., 2022) 83.2 90.2 90.5 84.3 91.7 91.5
T5-XL (Gur et al., 2022) 84.0 90.8 90.9 85.2 92.1 91.9
LongT5-Base 96.4 98.0 98.5 95.6 97.4 98.2
LongT5-Large 98.1 98.9 99.2 97.7 98.5 99.0
LongT5-XL 98.4 99.1 99.3 98.5 99.2 99.3
HTML-T5-Base 96.5 98.1 98.6 95.9 97.5 98.3
HTML-T5-Large 98.1 98.9 99.2 97.7 98.3 99.1
HTML-T5-XL 98.4 99.0 99.3 98.9 99.4 99.5
Table 9: Results of Description Generation benchmark (Gur et al., 2022). We compare LaMDA (Thoppilan
et al., 2022), T5, LongT5, and HTML-T5 with respect to accuracy, BLEU, and ROUGE-1 scores. The results
demonstrate that local and global attention mechanisms, shared modules between LongT5 and HTML-T5, could
almost completely solve the benchmark by improving the previous best performance by over 10%. HTML-T5
slightly outperforms LongT5.
24 | https://arxiv.org/pdf/2307.12856.pdf |
24 | Published as a conference paper at ICLR 2024
I FLAN-LONGT5
In the web automation literature (Furuta et al., 2023; Kim et al., 2023), instruction-finetuned LLMs
have great success in HTML comprehension and improve the task success. For the comparison to
HTML-denosing, we prepare the instruction-finetuned LongT5 (i.e. Flan-LongT5) by leveraging Flan
dataset released by Chung et al. (2022). We finetuned the pre-trained LongT5 with 100K iterations
and picked up the best checkpoints.
As a sanity check of instruction-tuning, we evaluate Flan-LongT5 with few-shot/zero-shot settings
on CoT benchmark (GSM8K (Cobbe et al., 2021), StrategyQA (Geva et al., 2021), SVAMP (Patel
et al., 2021), Asdiv (Miao et al., 2021), CommonsenseQA (Talmor et al., 2019)), BigBench-Hard
(BBH) (Suzgun et al., 2022), and MMLU (Hendrycks et al., 2021b) as tested in Longpre et al.
(2023). We reevaluate the performance of Flan-T5, using official checkpoints 5
. We also check
the performance of Flan-LongT5 on downstream summarization tasks, originally evaluated on
LongT5 (Guo et al., 2022). We use arXiv (Cohan et al., 2018), PubMed (Cohan et al., 2018),
BigPatent (Sharma et al., 2019), Multi-News (Fabbri et al., 2019), MediaSum (Zhu et al., 2021),
CNN / Daily Mail (Nallapati et al., 2016) dataset for the evaluation, measuring the performance with
ROUGE-1/2/L metrics.
Table 10 shows that we have successfully replicated the LongT5 version of instruction-finetuned
language models. Flan-LongT5 achieves competitive results to original Flan-T5; for instance, FlanLongT5-Large (36.64) outperforms Flan-T5-Large (35.25), but Flan-LongT5-XL (39.05) is still
behind Flan-T5-XL (43.03) on average. This might be caused by the training instability of XL-size
models (Guo et al., 2022). Because, unlike HTML-T5 on HTML-based tasks, reasoning tasks do
not have long-context or hierarchical syntax, it is not surprising for Flan-LongT5 not to outperform
Flan-T5. Table 11 also demonstrates that we have successfully conducted instruction-tuning without
losing the capability of long text summarization.
5https://github.com/google-research/t5x/blob/main/docs/models.md#
flan-t5-checkpoints
25 | https://arxiv.org/pdf/2307.12856.pdf |
25 | Published as a conference paper at ICLR 2024 CoT MMLU BBH BBH-CoT Avg. Models Zero Few Zero Few Zero Few Zero Few CoT Direct Total Flan-T5-Large 35.14 40.03 40.68 45.12 25.90 37.48 26.17 31.45 33.20 37.29 35.25 Flan-T5-XL 51.74 52.64 50.76 52.40 26.09 40.96 34.12 35.62 43.53 42.55 43.04 Flan-LongT5-Large 44.78 45.34 38.44 40.03 28.67 34.67 29.38 31.85 37.84 35.45 36.64 Flan-LongT5-XL 48.78 50.02 43.44 44.74 26.53 37.77 29.09 32.01 39.97 38.12 39.05 Table 10: Performance of Flan-LongT5 on reasoning tasks. We reevaluate the performance of Flan-T5 (Chung et al., 2022), using official checkpoints. Flan-LongT5 achieves competitive results to original Flan-T5. arXiv PubMed BigPatent MultiNews MediaSum CNN / Daily Mail Models R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L LongT5-Large 48.28 21.63 44.11 49.98 24.69 46.46 70.38 56.81 62.73 47.18 18.44 24.18 35.54 19.04 32.20 42.49 20.51 40.18 LongT5-XL 48.35 21.92 44.27 50.23 24.76 46.67 76.87 66.06 70.76 48.17 19.43 24.94 36.15 19.66 32.80 43.94 21.40 41.28
Flan-LongT5-Large 48.52 22.00 44.46 50.46 25.08 46.96 70.53 57.13 63.02 47.76 18.99 24.52 35.71 19.18 32.33 43.13 20.89 37.28
Flan-LongT5-XL 48.37 21.75 44.22 50.23 24.75 46.73 76.31 65.17 70.01 48.19 19.47 24.80 36.16 19.75 32.81 43.46 21.00 37.34
Table 11: Performance of Flan-LongT5 on downstream summarization tasks, compared to LongT5 (Guo et al., 2022). We measure the performance with ROUGE-1/2/L metrics.
26 | https://arxiv.org/pdf/2307.12856.pdf |
26 | Published as a conference paper at ICLR 2024
J PER-TASK PERFORMANCE ON MINIWOB++
Task HTML-T5-XL (347K) HTML-T5-XL (12K) Flan-T5-XL (347K) WebN-T5-XL (12K)
book-flight 0.99 0.00 0.48 0.00
choose-date 0.16 0.03 0.08 0.00
choose-date-easy 1.00 0.28 1.00 0.03
choose-date-medium 0.56 0.14 0.57 0.00
choose-list 0.22 0.19 0.16 0.26
click-button 1.00 0.92 0.98 1.00
click-button-sequence 1.00 1.00 1.00 1.00
click-checkboxes 1.00 1.00 1.00 0.96
click-checkboxes-large 0.90 0.94 0.98 0.22
click-checkboxes-soft 0.99 0.64 1.00 0.54
click-checkboxes-transfer 1.00 1.00 0.99 0.63
click-collapsible 1.00 0.41 1.00 0.00
click-collapsible-2 0.93 0.26 0.94 0.00
click-color 1.00 1.00 0.27 0.27
click-dialog 1.00 1.00 1.00 1.00
click-dialog-2 0.74 0.31 0.34 0.24
click-link 0.99 1.00 1.00 1.00
click-menu 0.37 0.26 0.41 0.37
click-option 1.00 1.00 1.00 0.87
click-pie 0.96 0.89 0.99 0.51
click-scroll-list 0.99 0.91 0.00 0.00
click-shades 0.00 0.05 0.00 0.00
click-shape 0.79 0.57 0.58 0.53
click-tab 1.00 1.00 1.00 0.74
click-tab-2 0.94 0.40 0.94 0.18
click-tab-2-hard 0.88 0.30 0.57 0.12
click-test 1.00 1.00 1.00 1.00
click-test-2 1.00 1.00 1.00 1.00
click-widget 1.00 0.94 1.00 1.00
count-shape 0.67 0.55 0.64 0.41
email-inbox 1.00 0.99 0.99 0.38
email-inbox-forward-nl 1.00 0.92 1.00 0.60
email-inbox-forward-nl-turk 1.00 1.00 1.00 0.33
email-inbox-nl-turk 0.99 0.76 0.92 0.23
enter-date 1.00 0.00 1.00 0.00
enter-password 1.00 0.99 1.00 0.97
enter-text 1.00 0.96 1.00 0.89
enter-text-dynamic 1.00 1.00 1.00 0.98
enter-time 1.00 0.00 0.00 0.00
focus-text 1.00 1.00 1.00 1.00
focus-text-2 1.00 1.00 1.00 1.00
grid-coordinate 1.00 1.00 1.00 0.49
guess-number 0.13 0.00 0.10 0.00
identify-shape 1.00 0.89 0.90 0.88
login-user 1.00 0.80 1.00 0.82
login-user-popup 1.00 0.63 0.97 0.72
multi-layouts 1.00 1.00 1.00 0.83
multi-orderings 1.00 1.00 1.00 0.88
navigate-tree 0.99 0.99 1.00 0.91
search-engine 0.93 0.55 0.59 0.34
social-media 0.99 0.93 0.99 0.21
social-media-all 0.31 0.84 0.09 0.00
social-media-some 0.89 0.60 0.39 0.02
tic-tac-toe 0.57 0.46 0.42 0.48
use-autocomplete 0.97 0.23 0.98 0.22
use-spinner 0.07 0.07 0.03 0.07
Average 0.856 0.655 0.755 0.484
Table 12: Per-task average success rate on 56 tasks from MiniWoB++. We refer to Furuta et al. (2023) and Gur
et al. (2022) for the baseline performances.
27 | https://arxiv.org/pdf/2307.12856.pdf |
27 | Published as a conference paper at ICLR 2024
K REAL-WORLD WEB AUTOMATION WITH DIFFERENT GENERALIST LLMS
We compare different generalist LLMs as a module of WebAgent among model-size variants (Flan-PaLM-8B, Flan-PaLM-62B, Flan-U-PaLM-540B), and publicly accessible LLM
(gpt-3.5-turbo). We test those models on map website following the same 20 instructions
in Appendix F. The results in Figure 9 imply that the performance of Flan-U-PaLM-540B and
gpt-3.5-turbo are the same (80% success, 93.8% score), and Flan-PaLM-62B (60% success,
86.3% score) is lower than Flan-U-PaLM-540B, which is caused by the inaccurate program synthesis. In addition, Flan-PaLM-8B could not generate proper programs at all. We believe that any
LLM with sufficient program synthesis capabilities could be integrated into WebAgent, including
Flan-U-PaLM-540B.
Success Score
Performance (%)
0
20
40
60
80
0.0 0.0
60.0
86.3
80.0
93.8
80.0
93.8
Program Plan Sum
Error Analysis (%)
0
20
40
60
80
100
100
0 0
75
25
0
25
50
25
50 50
0
Flan-PaLM-8B Flan-PaLM-62B Flan-U-PaLM-540B gpt-3.5-turbo
Figure 9: Performance (left) and error analysis (right) of real-world web automation with different generalist
LLMs. We compare model-size variants (Flan-PaLM-8B, Flan-PaLM-62B, Flan-U-PaLM-540B) and public
LLM (gpt-3.5-turbo) on map website.
28 | https://arxiv.org/pdf/2307.12856.pdf |