[译][论文] P5 paper | 用语言模型做推荐：一种统一的预训练、个性化提示和预测范式（2022）

[译][论文] P5 paper | 用语言模型做推荐：一种统一的预训练、个性化提示和预测范式（2022）
嗯，用户让我帮忙总结一篇文章的内容，控制在100字以内，而且不需要用“文章内容总结”之类的开头。首先，我需要仔细阅读用户提供的文章内容，理解其主要观点和结构。这篇文章是关于推荐系统的一个研究论文，提出了一个名为P5的统一范式。P5的核心思想是将各种推荐任务统一到一个基于语言处理的框架中，通过预训练、个性化提示和预测来实现。文章详细描述了P5的架构、训练过程以及实验结果。接下来，我需要提取关键信息：P5的目标是解决现有推荐系统中任务特定模型难以迁移的问题；它利用自然语言处理技术，将所有数据转换为文本形式，并通过预训练模型来捕捉深层语义；实验结果显示P5在多个任务中表现优异，并且具有良好的零样本泛化能力。然后，我要把这些信息浓缩成100字以内的总结。需要注意的是，避免使用复杂的术语，同时保持内容的完整性和准确性。最后，检查字数是否符合要求，并确保语言流畅自然。 </think> 本文提出了一种统一的推荐系统范式 P5（Pretrain, Personalized Prompt, and Predict Paradigm），通过将推荐任务转化为自然语言处理问题，利用预训练的语言模型实现跨任务的知识迁移。P5 将用户-物品交互、描述等数据转化为文本形式，并设计个性化提示（prompts）进行多任务预训练。实验表明，P5 在多种推荐任务中表现出色，并具备良好的零样本泛化能力。 2025-12-20 00:0:0 Author: arthurchiao.github.io(查看原文) 阅读量:4 收藏

Published at 2025-12-20 | Last Update 2025-12-20

译者序

本文翻译自 2022 年 RecSys 大会的一篇论文 Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)。

Figure 1: P5 pretrains on an encoder–decoder Transformer model that takes in textual inputs and produces target responses.

图 3：P5 架构示意图。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
摘要
1 引言
2 相关工作
3 个性化 prompts 集合
- 3.1 Prompts 设计
- 3.2 从原始数据构建训练数据集（prompts & answers）
4 P5 范式与模型
- 4.1 P5 架构
- 4.2 用预训练的 P5 进行推荐任务（推理）
5 实验
6 CONCLUSIONS AND FUTURE WORK

长期以来，不同的推荐任务通常需要针对特定任务设计 架构与训练目标 (task-specific architectures and training objectives)。这导致难以将学习到的知识与表征从一个任务迁移到另一个任务，从而限制了现有推荐方法的泛化能力。例如，一个序列推荐模型 (sequential recommendation) 很难被应用或迁移到评论生成 (review generation) 任务中。

考虑到语言几乎可以描述任何事物，而且语言基础是一种表征各种问题或任务的强大媒介，本文提出一种灵活、统一的文本到文本范式来解决以上问题 —— 这种范式我们称为 “Pretrain, Personalized Prompt, and Predict Paradigm” (预训练、个性化提示与预测范式)，缩写为 P5。它将各类推荐任务统一在一个共享框架中，

在 P5 中，所有数据 （user-item interactions, user descriptions, item metadata, user reviews 等）都被转换为统一的自然语言序列。
自然语言所蕴含的丰富信息有助于 P5 捕获更深层的语义，从而实现个性化推荐。

具体而言，P5 在预训练阶段通过相同的语言建模目标学习不同任务，从而成为各类下游推荐任务的基础模型。

P5 不仅能轻松与其他模态信息融合，还能基于提示实现指令驱动的推荐。
P5 将推荐系统从浅层模型、深度模型推进至大模型阶段，并将以通用推荐引擎的形式彻底革新推荐系统的技术形态。
通过为不同用户自适应生成个性化提示，P5 能够以零样本或少样本方式进行预测，大幅减少了对大量微调的依赖。

我们在多个推荐基准测试上进行了实验，验证了 P5 的有效性，相关代码和模型也已经开源：

github.com/jeykigung/P5 开源了源代码、数据集、提示词及预训练的 P5 模型。
huggingface.co/makitanikaze/P5 模型。

过去几十年，推荐系统取得了显著进步，并在人们的日常生活中发挥着重要作用。而现在，推荐系统在朝着特征更多样性、应用场景更广泛的综合系统发展。

1.1 现阶段推荐系统的特点

特征表示和学习越来越复杂

推荐系统中的 feature engineering 和 learning 已经从简单发展到复杂。

早期，推荐系统通常采用 logistic regression 或 collaborative filtering [25, 35, 50, 52]，利用 user-item interaction 数据来建模用户的行为模式。
之后，通过更复杂的模型如 factorization machines [48] 和 GBDT [20]，将 contextual features（如 user profile 和 item metadata）进一步整合到系统中。
最近，deep neural network models [3, 5, 19, 74] 促进了更加多样和复杂的特征之间的交叉与组合。因此，与传统基于 feature engineering 的方法相比，这些模型获得了更好的表示能力。

1.2 现代推荐系统需要什么

尽管现有的推荐系统取得了巨大成功，但在解决实际问题上仍面临很多问题，我们认为需要一个能支持多样特征和不同类型任务的综合推荐系统。

推荐任务通常共享同一个 user–item pool（用户-物品信息池）并具有重叠的 contextual features，因此，我们任务将多个推荐任务合并到一个统一框架中是非常有希望的，这样多个任务可以隐式地 transfer knowledge，相互受益，并泛化到其它没见过的任务。

1.3 P5 的创新点

受最近 multitask prompt-based training [1, 51, 67] 进展的启发，本文提出一个统一的范式 P5。它有三个主要优势：

将推荐模型（行为模型）深度融入到语言环境（语言模型）中。

基于 personalized prompts，所有推荐任务都被重新表述为 NLP 任务。由于自然语言足够灵活和强大，能够用文本表达各种类型的特征，因此无需设计 feature-specific encoders。通过这种方式，P5 可以充分利用训练语料库中丰富的语义和知识；

译注：从 Tokenization 视角看生成式推荐（GR）近几年的发展（2025）
将多个推荐任务放到同一个 text-to-text encoder-decoder 中，并使用相同的 language modeling loss 进行训练，而不是设计 task-specific 架构和 objective functions。

换句话说， P5 将所有 personalized tasks 视为 conditional text generation 问题；
通过 instruction-based prompts 训练，P5 在推广到新的 personalized prompts 或其它领域中未见过的 items 时，获得了良好的 zero-shot 性能。

Figure 1: P5 pretrains on an encoder–decoder Transformer model that takes in textual inputs and produces target responses. We trained P5 on a multitask collection of personalized prompts. After multitask prompt-based pretraining on recommendation datasets, P5 achieves the capability of zero-shot generalization to unseen personalized prompts and new items.

2.1 统一框架的尝试

之前已经有一些工作试图在统一模型中解决各种推荐任务。

基于通用语言模型（T5 和 GPT3）

早期先驱，

T5 ：通过 text-to-text encoder-decoder 框架统一了 NLP 下游任务。
GPT-3：通过 autoregressive language modeling 统一了 NLP 下游任务。

它们都能基于同一个预训练的语言模型实现不同任务之间的有效知识共享（即，通用模型）。

基于自然语言的 seq-to-seq 架构

最近业界开始专注于通过一个共享的 sequence-to-sequence 框架统一大规模语言任务 [1, 51, 67] 或跨模态应用 [6, 66, 71]，其中不同类型的任务和模态都以自然语言形式表达。

但是，这类方法没有在模型中考虑个性化。

基于通用用户表示

[56, 57, 72] 尝试学习易于迁移到下游任务的通用用户表示。这些方法的一个局限性是它们仍然需要在下游数据集上进行 finetuning。

相比之下，P5 将个性化纳入 encoder-decoder Transformer 模型，该模型可以泛化到广泛的需要个性化推荐的场景。此外，借助 prompt-based pretraining，P5 在迁移到未见过的 prompts 和 items 时获得了良好的 zero-shot generalization 能力。

2.2 通过提示的方式学习（Prompt Learning）

GPT 系列尤其是 GPT-3 的成功标志着 prompt 在 NLP 任务中的普及。

在互联网上收集的大量语言数据进行训练，GPT-3 展示了在提供少量输入-输出示例作为 exemplar prompts 时解决 NLP 任务的能力。
其他一些遵循 “pretrain, prompt, and predict” 范式的 prompt 设计方法最近也有发展 [37]。
- [16, 23, 36, 40, 58] 探索了针对特定离散提示的搜索。
- [18, 28, 33, 38, 45, 81] 利用连续向量 embedding 作为提示。

由于 instruction-based prompt 包含详细的任务描述，更符合自然语言方式，而且与人类的交流方式很类似，一些工作 [11, 68] 认为从多样的 NLP 数据集学习是通往通用 NLP 系统的一种方式。最近的工作如 FLAN [67] 和 T0 [51] 在大型 NLP 数据集上微调 pretrained language models，这些数据集通过人类可读的提示进行组织，在未见过的任务上表现出强大 zero-shot 能力。

受这些方法成功的启发，我们创建了一个个性化提示集，然后在一个多样化的推荐任务上训练一个 sequence-to-sequence 模型。

2.3 推荐领域的 NLP

推荐已经与 NLP 技术有很长时间的交集了。四个主要方向：

explainable recommendation [4, 10, 30–32, 75, 77] where NLP models help generating text explanations for a given recommendation;
sequential recommendation as language modeling [9, 60, 80] which considers user interaction histories as word token sequences;
text feature extraction [69, 74, 79] which aims to extract informative text encodings that can improve the performance of recommendation;
conversational recommendation [8, 12–14, 22, 61, 76] that reasons the intent of users and gives recommendation in an interactive dialog format.

本文主要涵盖前两种任务，并讨论了如何设计一个统一的 NLP 框架来涵盖 rating prediction、top-k recommendation 和 review summarization 等任务。

此外，通过使用与传统相似的指令式提示进行预训练，P5 受益于自然语言环境，提高了在系列推荐任务上的性能。

2.4 Zero-shot 和冷启动推荐

推荐系统的性能很大程度上依赖于可用的训练数据，但总是存在零样本或少样本的情况。如果在这类冷启动场景下，推荐系统的表现也很好，就表明这个推荐模型具有良好的泛化能力。

一个常见的研究是冷启动推荐，即用户 [26] 或物品 [53] 是新系统，没有之前的交互记录。

常见解决方案是学习去建模内容特征 [15, 29, 44, 55]，以便在没有交互记录的情况下进行推理，或者是从其他的辅助域学习迁移表示 [42, 56, 59, 72, 82]。
另一种解决方式是快速适应新域（quick adaptation to the new domain），而非供冷启动 case。解决方案通常遵循meta learning [27, 64] 或因果学习 [34] (causal learning) 框架，使模型对域适应具有鲁棒性。

在我们的工作中，我们要求 P5 模型在辅助域上预训练，以解决目标域上的任务，其中用户对 P5 是已知的，但物品 P5 是没见过的。

为了方便 multitask prompt-based pretraining，我们创建了一个个性化提示集。个性化提示集覆盖了五类不同的任务：

rating prediction
sequential recommendation
explanation
review
direct recommendation

每类任务包含多个个性化提示，帮助 P5 发现用户和物品的各个方面关联。

[51] 中，一个提示由一个输入模板和一个目标模板组成，以及一组相关的元数据。在本文中，我们进一步定义个性化提示为包含个性化字段的提示，用于不同的用户和物品（a prompt that includes personalized fields for different users and items）。

例如，一个用户的偏好可以通过一个 ID 描述，也可以通过一段文本描述表示。此外，给定个性化提示，期望模型输出也应该根据其物品字段而变化。这按时的说用户对不同物品的不同偏好。这样的物品字段可以表示为物品 ID 号码或包含详细描述的物品元数据。

3.1 Prompts 设计

我们针对每个任务设计了一个基本的个性化提示集。

rating prediction 提示词设计

对于 rating prediction 任务，我们将其提示分为三个类别：

给定用户和物品的信息，直接预测用户给该物品的评分，范围从 1 到 5；
预测用户是否会给一个物品指定的评分（rate an item a given score）。期望输出是 yes 或 no；
预测用户是否喜欢或不喜欢一个物品。

我们考虑评分等于或大于 4 为用户的喜欢偏好，而较低的评分表示用户的不喜欢偏好。

`sequential recommendation` 提示词设计

针对 sequential recommendation 任务，我们创建了三种类型的提示：

基于用户交互历史，直接预测下一个物品；
给定用户交互历史，从候选列表中选择可能的下一个物品，其中只有一个物品是正样本；
基于用户交互历史，预测给定物品是否会被用户下次交互。

`explanation` 提示词设计

针对 explanation 任务，我们要求 P5 生成一个文本解释，以证明用户对给定物品的偏好。两种提示：

直接生成一个包含用户和物品信息的解释句子；
基于一个特征词作为提示，生成解释。

对于每个类别，可能还包括其他辅助信息，例如评论标题和评分。

review 相关提示词设计

针对 review 相关任务，我们创建了两种类型的提示：

总结评论，生成一个更短的评论标题；
基于给定的评论，预测相应的评分。

`direct recommendation` 提示词设计

针对 direct recommendation 任务，我们创建了两种类型的提示：

预测是否向用户推荐一个物品，期望输出是 yes 或 no；
从候选物品列表中选择最合适的物品推荐给用户。

完整的个性化提示集见附录。

3.2 从原始数据构建训练数据集（prompts & answers）

构建训练数据的过程如图 2 所示，

图 2：根据设计的个性化提示模板，从原始数据构建训练用的 input-target pairs 或零样本测试个性化提示。原始数据来自三个数据源。具体的，rating/review/explanation （a）共享相同的原始数据，而 sequential recommendation (b) 和 direct recommendation (c) 使用类似的原始数据，但前者还需要用户交互历史。完整的 P5 个性化提示集见附录。

训练数据和预训练任务对这些数据中的信息进行萃取，提炼用户的偏好和个性化信息。预训练时，我们将不同任务的 input-target pairs 混合在一起作为训练数据。

为了增强 P5 的鲁棒性和零样本泛化能力，对于每个原始数据，我们只采样一部分，而不是每个任务中的所有个性化提示。在 sequential 和 direct recommendation 任务中，我们还会对那些需要候选列表的场景随机选择一些负物品。

所有预训练数据共享统一的 input-target token 序列格式，打破了不同任务之间的界限。在条件生成统一框架下预训练多个推荐任务可以提升所有任务的效果。

整个预训练阶段将 P5 沉浸在完整的语言环境中，我们期望增强其零样本泛化能力，能够理解新颖的个性化提示，即使这些提示包含详细的物品描述。这就是为什么 P5 被称为统一的“预训练、个性化提示和预测范式”（”Pretrain, Personalized Prompt, and Predict Paradigm”）。

4.1 P5 架构

具体到 P5 架构，我们采用基本的 encoder-decoder 框架，并使用 Transformer 构建编码器和解码器。

假设输入 token 序列的 embedding 为 $\mathbf{x} = \left[x_1, \cdots, x_n\right]$。如 Figure 3 所示，

图 3：P5 架构示意图。对于示例 prompt 输入 What star rating do you think user_23 will give item_7391?，P5 首先使用双向文本编码器编码输入，然后通过文本解码器自回归地生成答案。与任务特定的推荐模型不同，P5 基于 multitask prompt-based pretraining，因此能够适应不同的任务，泛化能力很强。

位置编码

增加位置编码，以捕获序列中的位置信息。

Whole-word embedding，补偿 item token 表示被 tokenizer 拆分带来的语义损失

为了使 P5 捕捉输入序列中包含的个性化信息，我们还应用 whole-word embedding $\mathcal{W}$ 来表示连续的 sub-word token 是否来自同一个原始单词。

为什么需要这个步骤呢？举个例子，

如果我们直接用 ID 7391 表示物品，即 item_7391，那么这个词经过 SentencePiece tokenizer 之后，就会变成 4 个独立的 token（item, _, 73, 91），而不是我们期望的一个。通过共享的 whole-word embedding （图 3 中的 <w10>），P5 可以更好地识别包含个性化信息的字段。
另一种方案是每个用户/物品用一个独立的额外 token 表示（例如，<item_7391>）。然而，当用户和物品数量很大时，这可能会引入大量的额外 token。

encoder & decoder

接下来，文本编码器将上述三个 embedding 的和 $\mathbf{e} = \left[e_1, \cdots, e_n\right]$ 作为输入，并输出上下文化之后的表示 $\mathbf{t} = \left[t_1, \cdots, t_n\right] = \mathcal{E}(\mathbf{e})$。

解码器 $\mathcal{D}(\cdot)$ 然后关注之前生成的 token $\mathbf{y}$ 和编码器输出 $\mathbf{t}$，并预测未来 token 的概率分布：

$P_{\theta}\left(\mathbf{y}_{j} \mid \mathbf{y}_{<j}, \mathbf{x}\right) = \mathcal{D}(\mathbf{y}_{<j}, \mathbf{t})$。

在预训练阶段，P5 minimizing the negative log-likelihood of label tokens y conditioned on input text x in an end-to-end manner：

这个相同的损失函数被所有 P5 下的推荐任务共享。因此，我们统一推荐任务，使用一个模型、一个损失和一个数据格式。

4.2 用预训练的 P5 进行推荐任务（推理）

在预训练之后，P5 可以直接个性化提示执行不同的任务，不管这些 prompts 它有没有见过。

对于 rating、explanation 和 review 任务，简单地使用贪心解码（greedy decoding）来生成答案。
对于 sequential 和 direct recommendation 任务，通常需要一个物品列表作为目标输出，使用 beam search。

对于 sequential recommendation，我们应用 beam search 生成一个潜在的下一个物品列表。对于 direct recommendation，我们从一个候选物品集合 $\mathbf{S} = {S_1, \cdots, S_m}$ 中预测推荐的物品，其中只有 $m$ 个候选物品中的一个是正样本。这里，我们同样使用 beam search 解码一个具有最高分数的潜在目标物品列表，然后进行评估。上述两种解码过程可以写为：

其中 $B$ 表示 beam size，$\mathbf{C}$ 表示输出物品列表。

本节我们评估 P5 在真实世界数据上的性能，并与其他代表性方法进行比较。通过性能比较和消融研究，我们旨在回答以下问题：

5.0 要回答的问题 (RQ 1~5)

问题一：P5 与 task-specific 方法的性能比较

How does our unified P5 framework perform compared with task-specific methods on all five task families?

问题二：P5 的零样本泛化能力

Does P5 have enough zero-shot generalization ability when transferring to unseen personalized prompts for either existing or new items?

问题三：P5 的性能如何受模型大小、任务数量和提示数量影响？

How do scaling factors such as model size, number of task families, and number of prompts affect the performance of P5?

问题四：P5 中实现个性化推荐的最佳方式是什么？（`unique token vs. sub-word units`）

Which is a better way to implement personalization in P5: adopting an independent extra token for each user or item (e.g., “⟨user_23⟩”) or the default setting, i.e., tokenizing each user or item into multiple sub-word units (e.g., “user”, “_”, “23”)?

问题五：P5 的预训练时间？P5 的推理性能？

How long does it take for P5 to conduct pretraining? Is it efficient to make inference with the pretrained P5 model? We provide statistics on training and inference time in the Appendix

5.1 Experimental Setup

Datasets

We conduct extensive experiments over four real-world datasets. The Amazon1 datasets are collected from Amazon.com platform with user ratings and reviews on 29 categories of products. In this paper, we adopt three of them to evaluate our method, namely Sports & Outdoors, Beauty, as well as Toys & Games. Besides, Yelp2 dataset contains a large number of user ratings and reviews for business recommendation. We follow [80] and use transaction records between January 1, 2019 to December 31, 2019. Due to space limit and that the results on Yelp show similar trends with other datasets, we put the experimental results on Yelp dataset in the Appendix. The detailed statistics of these datasets are presented in Table 1.

Task splits

For rating, explanation, and review task families, we randomly split each dataset into training (80%), validation (10%) and testing (10%) sets, and ensure that there is at least one instance included in the training set for each user and item. To obtain the ground-truth explanations, following the natural language explanation works [30, 31], we first extract item feature words from the reviews with the help of the Sentires toolkit3[77, 78], and then extract the sentences from reviews that comment on one or more item feature words as users’ explanation about their preference. In terms of sequential recommendation task family, for each user interaction sequence, the last item is used as the test data, the item before the last one is used as the validation data, and the remaining data is used for training. To avoid data leakage during pretraining, we follow the training split of sequential recommendation to build the training set for direct recommendation task family.

Implementation Details

Our P5 model utilizes the pretrained T5 checkpoints [47] as backbone. According to the size of T5 backbone, we create two versions of P5, namely P5-small (P5-S) and P5-base (P5-B). For P5-small, there are 6 layers for both encoder and decoder, the model dimensionality is 512 with 8-headed attention, and the number of parameters is 60.75 million. For P5-base, encoder and decoder both have 12 Transformer blocks. The model has an embedding dimensionality of 768 and a 12-headed attention, and the number of parameters is 223.28 million. For tokenization, we use the SentencePiece [54] tokenizer with a vocabulary size of 32,128 for parsing sub-word units. We pretrain P5 for 10 epochs with AdamW optimization [39] on four NVIDIA RTX A5000 GPUs. The batch size is set to 16 for P5-base and 32 for P5-small. We choose 1 × 10−3 as the peak learning rate and set the maximum length of input tokens to 512. The warmup strategy is used to adjust the learning rate during training, the warmup stage is set to be the first 5% of all iterations. When negative sampling is needed for training, we use 1:1 positive vs. negative sampling for both P5 and baselines. Our default pretrain–predict combination adopts the last prompt in each task family for zero-shot evaluation while all remaining prompts are utilized for multitask prompted pretraining. For rating prediction, we use Gaussian sampling to convert the original integer scores to float numbers rounded to 1 decimal place. In this way, we can avoid overfitting the limited score types. After this change, we increase the number of score classes from 5 to 41. For sequential recommendation, we set the beam size 𝐵 to 20. For direct recommendation, the beam size is also 20 and the candidate pool contains 100 items, which consist of one ground-truth item and 99 sampled negative ones that the user has not interacted with.

评估指标（Metrics）

对于 review prediction，我们采用 Root Mean Square Error (RMSE) 和 Mean Absolute Error (MAE) 评估。
对于 sequential recommendation 和 direct recommendation，我们采用 topK Hit Ratio (HR@K) 和 Normalized Discounted Cumulative Gain (NDCG@K) 评估，给出 HR@1, 5, 10 和 NDCG@5, 10 的结果。
对于 explanation generation 和 review summarization，我们采用 BLEU-4, ROUGE-1, ROUGE-2, 和 ROUGE-L 评估。

RMSE 和 MAE 是“越低越好”，而其他指标是“越高越好”。对于所有表格，粗体数字表示最佳性能，下划线数字表示第二最佳性能。

Rating Prediction and Direct Recommendation

These tasks take the user–item rating/interaction data, but no content or side information is provided. We aim to justify whether the models are able to provide accurate rating prediction or recommendation lists that align with the user preferences. We use MF [25] and MLP [5] under mean square root loss as rating prediction baselines. For direct recommendation, we use BPR-MF [49], BPR-MLP [5], and a state-of-the-art contrastive learning-based collaborative filtering model SimpleX [43] as baselines.

Sequential Recommendation

We adopt several representative sequential recommendation approaches as our baselines. Caser [63] treats sequential recommendation as a Markov Chain and employs convolutional neural networks to model user interests. HGN [41] adopts a hierarchical gating networks to learn user behaviors from the perspectives of both long and short terms. GRU4Rec [21] is originally proposed for session-based recommendation. It utilizes GRU [7] to model the user click history sequence. BERT4Rec [60] mimics the BERT-style masked language modeling and learns a bidirectional representation for sequential recommendation. FDSA [73] focuses on the feature transition patterns by modeling feature sequence with a self-attention module. SASRec [24] adopts selfattention mechanism in a sequential recommendation model, which reconciles the properties of Markov Chains and RNN-based approaches. S3-Rec [80] leverages self-supervised objectives to help sequential recommendation model better discover the correlations among different items and their attributes. We use the implementation of S3-Rec and its baselines for comparison4.

Explanation Generation

For performance comparison, we consider several baselines with regard to the task of explanation generation. Attn2Seq [10] learns to encode attributes into vectors, and then invokes an attention mechanism to generate reviews conditioned on the attribute vector. NRT [32] utilizes GRU [7] to generate explanations based on user and item IDs. PETER [31] is a simple and effective framework that attempts to utilize user and item IDs to generate explanations. It is built upon a modified attention mask of the Transformer architecture. There is also a variant PETER+, which takes a hint feature word to assist the explanation generation.

For review summarization, we adopt pretrained T0 [51] and GPT-2 [46] checkpoints hosted by Hugging Face5 as baselines. For review preference prediction, we only use T0 to make comparisons because GPT-2 cannot perform this task.

5.3 Performance Comparison on Different Task Families (RQ1)

In this section, we pretrain P5 with prompts from all five task families to verify its multitask learning ability. According to the default pretrain–predict task combination, we leave Prompt 1-10, Prompt 2-13, Prompt 3-12, Prompt 4-4, and Prompt 5-8 for zeroshot evaluation and pretrain P5 with the remaining personalized prompts. The performances of P5 and relevant baselines on the five task families are presented in Table 2 to Table 7. For each task family, we choose one or more seen prompts as supplement to the aforementioned zero-shot unseen prompts to perform evaluations.

5.3.1 Rating Prediction

Prompt 1-6 and Prompt 1-10 are used for evaluating P5’s performance on rating prediction. The performance comparison is presented in Table 2. We can see that when testing with seen Prompt 1-6, P5-B gets better MAE and slightly higher RMSE on all three datasets compared with MF. When testing with unseen Prompt 1-10, P5-B can achieve similar performance as Prompt 1-6. Moreover, P5-S usually has better MAE but higher RMSE. It seems that P5 is overfitting these data since the task complexity of rating prediction is relatively lower than other recommendation tasks. Overall, these results show that it is feasible to perform rating prediction on a conditional text generation framework.

5.3.2 Sequential Recommendation

As illustrated in Table 3, Prompt 2-3 and Prompt 2-13 are employed for the evaluation of sequential recommendation under all-item setting, i.e., using all items as candidates rather than sampling 100 or 1,000 items for ranking. From the table, we can see that P5-B surpasses all competitive baselines with a relatively large gap on both seen (Prompt 2-3) and unseen (Prompt 2-13) prompts. On Toys, P5-S can get even better performance than P5-B. While on Beauty and Sports, P5-B achieves the advantage over P5-S. The results show that the P5 architecture is effective in modeling the user interaction history and conducting next item prediction with the help of beam search.

5.3.3 Explanation Generation

In Table 4, Prompt 3-9 and Prompt 3-12 are used to evaluate P5’s performance on explanation generation under feature-based setup, while Prompt 3-3 is used for direct explanation generation without providing a hint word. We can see that for Prompt 3-3, P5 achieves the best performances against all baselines. For feature-based prompts (Prompts 3-9 & 3-12), P5 can outperform PETER+ on most cases, especially for Beauty and Toys.

We take Prompts 4-2 and 4-4 to compare P5’s performance with T0 on review preference prediction, as shown in Table 5. We can see that P5-S achieves better RMSE and MAE on Beauty and Toys, while P5-B shows better performance on Sports. Additionally, we take Prompt 4-1 to evaluate P5’s ability on review summarization, as shown in Table 6. For this task, P5-S clearly outperforms T0 and GPT-2 on both Beauty and Toys datasets. It is worth noting that GPT-2 and T0 has 1.5B and 11B parameters, respectively. This shows that P5 can achieve better performances than these competitive baselines with a much smaller model size.

5.3.5 Direct Recommendation

Finally, Prompts 5-1, 5-4, 5-5 and 5-8 are applied to evaluate the direct recommendation task under the 1-out-of-100 evaluation setting. For binary question prompts (5-1 & 5-4), which are discriminative prompts, we use the softmax generation probability of “yes” to rank the candidate items. For open question prompts (5-5 & 5-8), which are generative prompts, we use beam-search (Eq.(2)) to generate the top-𝑘 list. The results are presented in Table 7. From the table, we can see that P5-B and P5-S have great advantages over BPR-MF and BPR-MLP on all three datasets. Comparing with SimpleX, we can see that P5 works especially well on top-1 item ranking, which is more than two times better than SimpleX on HR@1. Besides, P5 also achieves the best result on most of the other metrics. The success of P5 on direct recommendation shows the competence of the sequence-to-sequence generation framework in recommendation domain.

5.4 Zero-shot Generalization to Unseen Prompts and Items in New Domain (RQ2)

5.4.1 Transfer to Unseen Personalized Prompts

In this section, we transfer the pretrained P5 models to the previously heldout prompts during pretraining. These unseen prompts are from the same task families, and the testing items have been seen by P5 during pretraining at least once. The experimental results are also reported in Table 2 to Table 7. As previously discussed in Section 5.3, P5 achieves surprisingly good performances on various task families when being challenged by unseen prompts. On some specific datasets, the performances of P5 on unseen prompts even surpass seen prompts, e.g., P5-B gets the best performance under Prompt 2-13 on Sports. These results show that multitask prompted pretraining empowers P5 enough robustness to understand unseen prompts with wording variations.

5.4.2 Transfer to Items in New Domain

Next, we increase the difficulty level of zero-shot transfer. We collect a group of 741 users that exist in all the three domains with their interaction and review histories in other domains. The detailed statistics of these domain transfer evaluation sets are illustrated in Table 8. We then challenge P5-B pretrained on one domain with unseen prompts from the Task Family Z, whose item fields are filled with the information from a new product domain. For example, we ask the P5 model pretrained on the Toys domain about an existing user’s preference towards an item in the Beauty domain. The full results on all six directions are reported in Table 9. From the table, we notice P5 still maintains sufficient performances for rating prediction (Prompts Z-2 & Z-3), like/dislike prediction (Prompts Z-1 & Z- 4), as well as explanation generation with feature word (Prompt Z-6). In contrast, direct explanation generation without feature word (Prompts Z-5 & Z-7) is very difficult for P5 because it lacks awareness of relevant knowledge in the new domain. In Figure 4, we provide some example explanations generated by P5-B under the setup of zero-shot domain transfer (Prompt Z-6). We can see that P5 is able to catch different users’ rating preferences and hint feature words, then integrate them with the knowledge learned from previous domain to generate plausible explanations.

5.5 Ablation on Model Size (RQ3)

In this section, we will discuss the influence of model size on the performance of P5 on different recommendation tasks. Here, we train two size variants of P5, namely P5-small and P5-base. The parameter numbers of these two P5 models are 60.75M and 223.28M, respectively. From Table 2 to Table 7, we can see that although P5-S is only 1/4 of the size of P5-B, P5-S can beats P5-B on a series of tasks and datasets. For example, P5-S achieves better sequential recommendation, review preference prediction, and direct recommendation (Prompts 5-5 & 5-8) performances than P5-B on Toys. In contrast, P5-B shows advantages on sequential recommendation and review preference prediction tasks for Sports. Since Sports contains more users, items and reviews and has a lower sparsity, it requires a model with higher capacity to discover latent correlation among different personalized factors. The findings indicate that larger P5 models may be needed when the dataset is large, while for smaller datasets, smaller P5 models could be enough. As a result, we should decide an appropriate model size that matches the scale of the training data.

5.6 Ablation on Task Scaling (RQ3)

Moreover, we explore whether multitask prompted pretraining is superior than pretraining on each task family alone. We pretrain P5-small on Beauty dataset with prompts from every single task family, resulting in five models – P5-S1, P5-S2, P5-S3, P5-S4, and P5-S5. We then compare P5-S on various recommendation tasks with the corresponding single task P5 model. The performance comparison between P5-S and P5-SN (𝑁 ∈ [1, 2, 3, 4, 5]) is illustrated in Figure 5. As shown in the figure, P5-S achieves comparable or better performance than P5-SN on rating prediction, sequential recommendation and direct recommendation tasks, while on text generation tasks such as explanation generation (Prompts 3-9 & 3-12) and review summarization (Prompt 4-1), P5-SN is better than P5-S. This indicates that multitask modeling (P5-S) seeks a good balance among tasks and improves recommendation performance by leveraging the power of language understanding. Besides, both P5-S and P5-SN perform better than or comparable with state-ofthe-art baselines on all tasks, as shown in Table 2 through Table 7, which demonstrates the power of P5 for recommendation.

5.7 Ablation on Prompt Scaling (RQ3)

As mentioned in implementation details, our default pretrain–predict task combination follows the leave-one-out strategy. However, do we need so many prompts during pretraining to enable P5’s zeroshot generalization ability? In this section, we explore to reduce the number of pretraining prompts and then make comparisons with the P5 model pretrained under default setup. To this end, we choose a collection of pretraining prompts that has the minimum number of prompts to cover all important personalized fields. Specifically, this combination contains the following 18 personalized prompts: {1-5, 1-6, 1-8, 1-9, 2-1, 2-3, 2-8, 2-11, 3-2, 3-3, 3-6, 3-9, 4-1, 4-2, 4-3, 5-2, 5-5, 5-7}. Similar to the default pretrain–predict combination, the last prompt in each task family is for zero-shot evaluation. We name this prompt scaling variant of P5-small as P5-PS and then pretrain P5-PS on Beauty dataset. The performance comparison between P5-S and P5-PS is also presented in Figure 5. From the figure, we can observe that P5-S beats P5-PS on most tasks except for some generation tasks (i.e., Prompts 3-3, 3-9 & 4-1). Interestingly, P5-S outperforms P5-PS on Prompt 3-12 – a zero-shot explanation generation task. In fact, P5-S also shows its superiority on other zero-shot tasks such as Prompts 1-10, 2-13, and 5-8. Overall, we can find that larger number of high quality personalized prompts can generally help P5 achieve better performances on various recommendation tasks especially zero-shot tasks with unseen prompts.

5.8 如何实现个性化（`unique tokens vs. sub-word units`） (RQ4)

这一节讨论不同的个性化实现方式，并比较它们在 P5 中的性能。

方案一（默认，P5-S 模型）：是使用 SentencePiece tokenizer 将个性化字段拆分为多个 sub-word 单元，同时使用 whole-word embedding 来保留字段信息（见图 3）。
方案二：给每个 user 和 item 一个独立 token。这里我们称之为 P5-I。

前者利用协同学习隐式优化不同 sub-work token 之间的相关性，后者通过新引入的 token 学习到了每个唯一的用户或物品。性能比较见下图，

Figure 6: Performance of P5-S and P5-I on Beauty showing the influence of how to implement personalization.

可以看到

P5-I 在回归任务（Prompts 1-6 & 1-10 for rating prediction, Prompts 4-2 & 4-4 for review-based rating regression）和摘要生成任务（Prompt 4-1）上与 P5-S 表现相似。
P5-I 在解释生成任务（Prompts 3-3, 3-9 & 3-12）上略优于 P5-S。
P5-I 在顺序推荐和直接推荐任务（all prompts in Figure 6 (c) & (d)）上显著低于 P5-S，差距很大。

P5-I 性能较低的原因，跟 T5 初始化的那些原始子 sub-word units 比，新引入的大量额外 token 和 embedding 太稀疏。

这表明我们采用的 sub-word 方案可以通过协同学习实现更好的推荐和整体性能，同时只需要保持数量比较少的可学习 tokens。

根据论文 5.8 小节（”How to Implement Personalization”），P5-I（为每个用户/物品分配独立 token）与 P5-S（默认的子词单元分解）在不同任务场景下表现差异显著，具体原因可归结为以下四个层面的技术细节：

Kimi 老师的进一步解释（译注）

一、参数效率与协同学习机制的根本差异

实现方式参数规模初始化方式协同学习能力

P5-S（默认）恒定（复用 T5 的 32,128 词表）继承 T5 预训练子词嵌入强：不同 ID 共享子词单元（如”user”、”_“、”12”），通过组合模式隐式学习用户/物品间的关联

P5-I（独立 token）线性增长（需为每个用户/物品新增嵌入）随机初始化弱：每个 ID 嵌入完全独立，只能通过任务监督信号学习，无法跨 ID 共享知识

核心问题：P5-I 为 Amazon Sports 数据集的 35,598 个用户和 18,357 个物品各新增一个 token 时，需引入约 5.4 万个全新嵌入向量，这些参数从零开始训练且在预训练数据中出现频率不均，导致：

高频 ID 过拟合到特定训练样本

低频 ID 欠训练，表示质量差

失去 T5 原有的语言理解和泛化能力

二、任务场景差异的具体分析

1. P5-I 表现”相似或略好”的场景：回归任务 & 文本生成任务

具体任务：评分预测（Prompt 1-6/1-10）、评论偏好预测（Prompt 4-2/4-4）、解释生成（Prompt 3-3/3-9/3-12）

原因：

监督信号直接：这些任务的输入包含丰富的语义信息（如评论文本、物品标题），模型主要依赖 T5 的编码-解码能力，对 ID 本身的协同信号需求较低

记忆优势：P5-I 的独立嵌入能有效”记忆”特定用户的评分/写作风格模式，在训练集上获得更低损失

论文数据佐证：在 Beauty 数据集上，P5-I 在解释生成任务 BLEU-4 分数略高（+0.02），但在 Sports 数据集上无显著差异，说明小数据集上记忆效应更明显

2. P5-I 表现”显著更差”的场景：纯推荐任务

具体任务：

序列推荐（Prompt 2-3/2-13）：需建模用户行为序列中的模式转移（如”买了篮球→可能买球鞋”）

直接推荐（Prompt 5-5/5-8）：需从候选物品中选出最匹配的 top-k

性能差距数据（论文 Table 7 & Figure 6）：

Sports 数据集上，P5-I 的 HR@1 比 P5-S 下降 61%（0.0701→0.0274）

Beauty 数据集上，NDCG@5 下降 47%（0.1673→0.0882）

根本原因：

协同信号丢失：子词分解让相似 ID 共享模式（如”item_12345”和”item_12346”共享前缀），P5-I 完全隔离，无法捕捉用户-物品交互的隐含结构

冷启动灾难：在 zero-shot 场景（Prompt 5-8），P5-I 对未见物品的独立嵌入从未被训练，预测完全失效；而 P5-S 可通过子词组合泛化到新物品 ID

优化困难：P5-I 的 ID 嵌入参数量巨大，在 multitask pretraining 中梯度更新不稳定，易陷入局部最优

三、数据集规模的影响

论文图 6 显示，数据规模越大，P5-I 劣势越明显：

数据集用户/物品数 Sparsity P5-I vs P5-S 在直接推荐上的差距

Sports 35,598 / 18,357 0.0453% 最大（HR@1 差距-61%）

Beauty 22,363 / 12,101 0.0734% 中等（HR@1 差距-30%）

Toys 19,412 / 11,924 0.0724% 最小（但仍有-15%）

解释：数据越稀疏，越需要强大的协同过滤能力来挖掘有限交互中的模式。P5-S 的子词共享机制在稀疏场景下能有效迁移知识，而 P5-I 的独立嵌入因样本不足无法充分训练。

四、总结

原文指出：

“The reason behind P5-I’s lower performance lies in that the newly introduced huge number of extra tokens and embeddings cannot be well trained compared with the original sub-word units initialized from T5.”

P5-S 通过whole-word embedding 补偿了子词拆分带来的信息损失，既保留协同学习能力，又避免引入过多新参数，是实现个性化更优的工程选择。

总结：P5-I 仅在对协同信号依赖低的文本生成任务中表现尚可；在核心推荐任务中，其参数爆炸性增长和协同能力缺失导致性能灾难性下降。这验证了论文的核心设计理念——通过语言建模的统一框架实现知识迁移，而非为每个实体学习孤立表示。

实现方式	参数规模	初始化方式	协同学习能力
P5-S（默认）	恒定（复用 T5 的 32,128 词表）	继承 T5 预训练子词嵌入	强：不同 ID 共享子词单元（如”user”、”_“、”12”），通过组合模式隐式学习用户/物品间的关联
P5-I（独立 token）	线性增长（需为每个用户/物品新增嵌入）	随机初始化	弱：每个 ID 嵌入完全独立，只能通过任务监督信号学习，无法跨 ID 共享知识

数据集	用户/物品数	Sparsity	P5-I vs P5-S 在直接推荐上的差距
Sports	35,598 / 18,357	0.0453%	最大（HR@1 差距-61%）
Beauty	22,363 / 12,101	0.0734%	中等（HR@1 差距-30%）
Toys	19,412 / 11,924	0.0724%	最小（但仍有-15%）

In this paper, we present P5 which unifies different recommendation tasks into a shared language modeling and natural language generation framework. By designing a collection of personalized prompts covering five recommendation task families, we transfer all raw data such as the user-item interactions, user descriptions, item metadata, and user reviews to the same format – input-target text pairs. We then pretrain P5 in a full language environment to help it discover deeper semantics for various recommendation tasks. According to our experiments, P5 can beat or achieve similar performance with several representative approaches on all five task families. Moreover, P5 shows the generalization ability on performing zeroshot transfer to new items, new domains, and new personalized prompts. In the future, we will continue exploring to further enlarge the model size of P5 and employ more powerful base models such as GPT-3, OPT, and BLOOM. Besides, P5 is a very flexible paradigm and it is promising to further extend P5 to diverse modalities and more tasks such as conversational recommendation, comparative recommendation, cross-platform recommendation, or even various search tasks by incorporating user queries into P5. Finally, in this work, we designed explicit prompts since they are intuitive, flexible, and close to the natural way of how humans communicate with each other, which enables instruction-based recommendation, while in the future, we will also investigate prompt search and/or latent prompt techniques to achieve instruction prompts or leverage retrieval-enhanced generation to further boost P5’s performance on downstream tasks.

文章来源: https://arthurchiao.github.io/blog/p5-paper-zh/
如有侵权请联系:admin#unsafe.sh

译者序