Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations (2024)

Nuo Chen^♣ Hongguang Li^♢
Juhua Huang^♢ Baoyuan Wang^♢ Jia Li^♣

^♣

Hong Kong University of Science and Technology (Guangzhou)
Hong Kong University of Science and Technology
^♢Xiaobing.AI
nchen022@connect.ust.hk, jialee@ust.hk

Abstract

Existing retrieval-based methods have made significant strides in maintaining long-term conversations. However, these approaches face challenges in memory database management and accurate memory retrieval, hindering their efficacy in dynamic, real-world interactions. This study introduces a novel framework, COmpressive Memory-Enhanced Dialogue sYstems (COMEDY), which eschews traditional retrieval modules and memory databases. Instead, COMEDYadopts a "One-for-All" approach, utilizing a single language model to manage memory generation, compression, and response generation. Central to this framework is the concept of compressive memory, which integrates session-specific summaries, user-bot dynamics, and past events into a concise memory format. To support COMEDY, we curated a large-scale Chinese instruction-tuning dataset, Dolphin, derived from real user-chatbot interactions.Comparative evaluations demonstrate COMEDY’s superiority over traditional retrieval-based methods in producing more nuanced and human-like conversational experiences. Our codes are available at https://github.com/nuochenpku/COMEDY.

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations

Nuo Chen^♣ Hongguang Li^♢Juhua Huang^♢ Baoyuan Wang^♢ Jia Li^♣^♣Hong Kong University of Science and Technology (Guangzhou)Hong Kong University of Science and Technology^♢Xiaobing.AInchen022@connect.ust.hk, jialee@ust.hk

1 Introduction

Maintaining long-term conversations has always been a long-term pursuit in current open-domain dialogue systems Liu etal. (2016); Zhang etal. (2018); Kann etal. (2022); Song etal. (2023), commonly known as chatbots or conversational agents.Long-term conversation refers to the ability of a conversational agent to engage in extended dialogues over multiple interactions, often spanning several days or weeks even months. This setting is challenging because it necessitates not only a deep understanding of the immediate dialogue context but also the retention and integration of key information from past interactions.Effective long-term conversation requires a system to memorize or recall past dialogue snippets, contextual nuances, and user preferences, which are crucial for maintaining coherence and relevance in ongoing interactions Wu etal. (2022); Zhang etal. (2022).

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations (1)

To acquire useful information from past conversations, the most mainstream approach in the field of long-term conversation currently is retrieval-based methods, as illustrated in Figure 1 (a): Firstly, previous works Xu etal. (2022b); Bae etal. (2022) usually employ a memory generator to summarize relevant memories from past sessions, such as user portraits. In this step, the memory generator can either be a separately trained model or a powerful large language model (LLM) like GPT4 OpenAI (2023); Subsequently, a dedicated memory database, or a memory bank, is used to store these memories. Some studies Zhong etal. (2023b) even store past conversational utterances directly in the storage; Going a step further, some works Bae etal. (2022); Wang etal. (2023) propose the use of specific memory management operations to update and iterate the memory database; The final and indispensable step involves employing a sentence-embedding model Guu etal. (2020); Lewis etal. (2020) to retrieve the most relevant memories from the memory database in relation to the current conversation. The current conversation and related memories are then inputted into a specialized response generator to produce the final response.

Despite the notable success achieved by the retrieval-based methods, they encounter several limitations that impact their overall efficacy and applicability: 1) One significant challenge is the unpredictability of performance. The system’s effectiveness is contingent upon several modules (like memory generator and retriever) working in tandem; moreover, the retriever component does not guarantee the retrieval of relevant and effective memories. Sentence-embedding models, commonly used for this purpose, may not always capture the nuances and context of the conversation accurately. 2) Another clear challenge lies in the management of the memory database. As conversations accumulate, the size and complexity of the memory database grow, making it increasingly difficult to manage. Ensuring that the stored information remains relevant and up-to-date is a constant concern, as outdated or irrelevant data can lead to inaccurate or inappropriate responses.

Moreover, current training corpus of long-term conversation chatbots is commonly either involved in constructing personalized dialogue data using LLMs like ChatGPT or hiring crowd-workers to simulate conversations Xu etal. (2022b). Unlike these structured or predictable dialogues, real-world conversations can veer into a wide range of topics, include colloquial language, and incorporate nuanced expressions Chen etal. (2023a). Meanwhile, the memory database in real scenarios needs to store memories from multiple chatbot-users, increasing the difficulty in accurately retrieving relevant memories and maintaining an up-to-date memory database. The above issues present a more pronounced challenge for deployingretrieval-based methods in real-world conversations.

To address these concerns, we propose a LLM-based COmpressive Memory-Enhanced Dialogue sYstems framework (COMEDY). COMEDYmarks a significant departure from existing methodologies, as it operates without a retrieval module. At its core, COMEDYadopts a groundbreaking “One-for-All” approach, utilizing a single, unified model to manage the entire process from memory generation, compression to final response generation, as shown in Figure 1 (b):It firstly involves distilling session-specific memory from past dialogues, encompassing fine-grained session summaries, including event recaps, and detailed user and bot portraits; In a break from traditional systems, COMEDYeschews the use of a memory database for storing these insights. Instead, it reprocesses and condenses memories from all past interactions, forming a compressive memory.The first part is the concise events that have occurred throughout all the conversations, creating a historical narrative that the system can draw upon. The second and third parts consist of a detailed user profile and the dynamic relationship changes between the user and chatbot across sessions, both derived from past conversational events. This holistic memory allows COMEDYto generate responses that are not only contextually aware but also personalized and adaptive to the evolving nature of the user-chatbot relationship; Finally, COMEDYskillfully integrates this compressive memory into ongoing conversations, enabling contextually memory-enhanced interactions.Unlike retrieval-based systems that may struggle to fetch pertinent memories from a vast database, COMEDY’s compressive memory is inherently designed to prioritize salient information, allowing for quicker and more accurate memory utilization.

To ensure that COMEDYis well-suited for real-world long-term conversations and overcome the issues of lacking relevant labeled data,we have methodically assembled an large-scale instruction-tuning dataset from actual online user-chatbot interactions, named Dolphin. This dataset contains three tasks: Session-Level Memory Summarization; Memory Compression; Memory-Grounded Response Generation, comprising an extensive collection of 100k samples.Dolphin is well-annotated to support each critical phase in COMEDY’s operation, from memory extraction and compression to integration and response generation. This dataset lays a robust foundation for enhancing COMEDY’s dialogue capabilities, ultimately leading to a more nuanced and human-like conversational experience compared to retrieval-based baselines.Our COMEDYhas been deployed on the X Eva platform and has received over 10 million calls, indicating its widespread use and acceptance.

Our contributions are summarized as follows:

•
We introduce COMEDY, represents a groundbreaking shift from traditional memory retrieval-based dialogue systems. It not relies on any retriever module and memory database, but generates enhanced, memory-informed responses with compressive memory in terms of comprehensive human evaluation.
•
We annotate a large-scale (100k) long-term conversation instruction tuning dataset, Dolphin, from actual online user-chatbot interactions. It can strengthen compressive memory-augmented models’ ability to adapt to evolving conversational styles and user preferences. To the best knowledge of ours, Dolphin is the current biggest Chinese long-term memory conversation dataset.
•
COMEDYcould handle the whole long-term conversation interactions via a singular model, achieving a higher degree of result consistency and predictability, reducing computational overhead, eliminates the need for data transfer between multi-models.
•
We further combine COMEDYwith Directly Preference Optimization (DPO) Rafailov etal. (2023) alignment strategy, and propose a simple strategy to mine efficient preferred and dispreferred memory-based responses. COMEDY-DPO shows better ability in generating coherent and contextually memory-grounded responses.

2 Methodology

In this section, we firstoverview the problem formulation of long-term conversations in COMEDY-style. Then, we introduce three task definitions and detailed data collection in Dolphin. Last, we present the training and DPO details of COMEDY.

2.1 Problem Formulation

An episode $D$ ( $D_{1},..,D_{t-1}$ ) is composed of a sequence of previous dialogue sessions between the chatbot and a specific user. The dialogue context for a given session at time step $t$ is represented as $D_{t}=\{c_{1},u_{1},c_{2},u_{2},\ldots,c_{t},u_{t}\}$ , where $c$ and $u$ denote the chatbot’s and user’s utterances.

In COMEDY, we aims to train a well-performed model $\mathcal{M}(\theta)$ , that first extracts session-level memory derived from previous sessions within $D$ , denoted as $M=\{m_{1},m_{2},\ldots,m_{t-1}\}$ (Task 1). Each $m$ contains natural sentences about session-level events and user profiles. Then $\mathcal{M}(\theta)$ will takes M as inputs, and outputs the compressive memory $\hat{M}$ that contains detailed user portraits like characteristics, recent states (emotional, work), etc; and concise record of all events (Task 2).Finally, $\mathcal{M}(\theta)$ generates the forthcoming response $c_{t+1}$ , based on the current dialogue context $D_{t}$ and $\hat{M}$ (Task 3).In the following, we introduce how we annotate the labeled data for each task.

Statistics	Train			Test
Statistics	Task 1	Task2	Task 3	Task 1	Task2	Task 3
Avg. Turns Per Session	13.0	-	13.9	19.5	-	10
Avg. sentences Per Session-level Memory	5.7	-	-	5.3	-	-
Avg. words Per Turn	15.9	-	19.5	20.7	-	16.3
Avg. words Per Compressive Memory	-	240.7	-	-	276.8	-
Total AI Characters	3,998	3,998	3,998	31	31	31
Total Sessions/Compressive Memories	39,999	30,695	31,131	465	31	127
Total Turns	459,511	-	432,721	14,415	-	3,937

2.2 Task and Datasets Collection

The source data in Dolphin originates from X Eva¹¹1https://xeva-h5.xiaoice.com/Content/Landing, one of the most popular Chinese AI-User social media platform akin to Character.AI.A distinctive feature of Dolphin is that the AI characters within X Eva are defined by the users themselves. This means that each character can have unique personalities, backgrounds, and conversational traits, as determined by the user’s input and creativity.

In the creation of the Dolphin dataset for COMEDY, we first select the episode $D$ that contains 15 sessions at least between same user and AI characters as our source dialogue data, with filtering useless and toxic information. Thenwe have adopted an efficient LLM-Annotators hybrid approach to annotate each task data Chen etal. (2023d): (1) We initiate the dataset generation using GPT4-Turbo, specifically tailored for dialogue summaries and memory-grounded dialogues. This step is crucial for creating a comprehensive base of dialogues, encompassing a wide range of conversational scenarios and memory contexts; (2) Following the initial generation, three skilled annotators meticulously review and refine the data. This involves correcting inaccuracies, enhancing dialogue quality. The annotators play a vital role in bridging the gap between automated generation and the nuanced understanding required for high-quality COMEDY.

To protect user privacy, all personal identifiers are removed from the dataset. This includes names, locations, or any specific details that could lead to the identification of individuals. Relevant discussion are presented in Ethical Claims.

Task 1: Session-Level Memory Summarization.

In the process of gathering data for Task 1, we encounter a substantial challenge. The initial collection yielded over 500,000 session-level data points, making it impractical to annotate all of them through GPT4-Turbo and manual methods due to the sheer volume. To tackle this, we initially focus on annotating a subset of approximately 40,000 data: For each dialogue session in the same episode $D$ , we first require the GPT4-Turbo to extract session-level memories, including the event, user and bot portraits in natural sentences. Then annotators edit the generated summaries by add missing information or revising erroneous sentences, resulting session-level memory $m_{n}$ . Utilizing the annotated subset, we then develop a specialized LLM for session-level memory generation, efficiently expanding our dataset while maintaining the quality and consistency of the session-level memory annotations across the larger dataset. Samples with none informative content, leading to ineffective memory outputs from LLM or GPT4-Turbo, are filtered out to maintain data quality.As a result, in this task, we collect fine-grained memories $M=\{m_{1},m_{2},\ldots,m_{n}\}$ for each session in $D$ .

Task 2: Memory Compression.

In this task, the focus is on memory compression, a crucial step in refining the data for COMEDY. GPT4-Turbo is tasked with summarizing all session-level memory $M$ in the episode from Task 1. The output from GPT4-Turbo includes: 1) A Comprehensive User Profile: Detailing characteristics, behavioral patterns, and recent states of the user.2) Evolving Dynamics between User and Bot: Capturing the relationship’s progression and interaction nuances.3) Concise Record of Past Events: Summarizing key happenings and dialogues from previous sessions. Considering the potential complexity and variance in the summarization process, GPT4-Turbo is configured to generate outputs three times with a temperature setting of 0.9. This setting allows for a balance between creativity and relevance, enabling GPT4-Turbo to produce diverse and insightful summaries.Then annotators step in to refine and calibrate the outputs, includes:Correcting any inaccuracies or inconsistencies in the summaries;Ensuring that the summarized data accurately reflects the user profiles, relationship dynamics, and event records;Enhancing clarity and conciseness where necessary. This hybrid approach ensures that compressive memory $\hat{M}$ meets the high-quality standards required for the subsequent stages of COMEDY’s development. We show examples of $\hat{M}$ in Table 7.

Task 3: Memory-Grounded Response Generation.

In this task, the process begins with integrating the compressive memory $\hat{M}$ , obtained from Task 2, with the incoming conversation at time step t, denoted as $D_{t}$ . The annotation process are similar with previous tasks: Initial response drafts are generated by GPT4-Turbo, based on the integrated data of $\hat{M}$ and $D_{t}$ . Annotators then review and refine these responses, focusing on aspects like relevance, coherence, and personalization. They ensure that each annotated response $c_{t+1}$ accurately reflects the user’s current state and previous interactions, maintaining high memorability and engagingness.To ensure the scale of the training data, we annotate all sessions within one day closest to the previous $D$ timing as the corpus of Task 3.

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations (2)

Test Set

To assess the effectiveness of the COMEDY, we well-design a test set that mirrors real-world dialogue scenarios as closely as possible:

•
We select dialogue data from the X Eva platform, specifically targeting conversations that involved the same AI-User pair engaging in over 16 sessions within a week. This criterion ensures that the dialogues have sufficient depth and continuity, which are crucial for testing memory-enhanced dialogue systems.
•
The first 15 sessions from these selected dialogues serve as the basis for generating the compressive memory, aligning with the objectives of Task 1 and 2 in our dataset.
•
The subsequent 1-5 sessions are then used as test scenarios to evaluate how well the model integrates the compressive memory into ongoing dialogues (Task 3). This provides a practical testbed for assessing the system’s conversational abilities in an evolving context.

Of note, we also manually annotate the session-level memory and the resulting compressive memory in the first 15 sessions. They are used to evaluate the model’s performance in Task 1 and 2. Our Quality Control process in Appendix B, and prompts of each task that takes as inputs into GPT4-Turbo are shown in Appendix E.

As a result, the statistics of our dataset are shown in Table 1. Dolphin comprises a total of 102,882 samples in training and test sets. Tasks 1 and 2 (Memory Extraction and Compression) contain 39,999 and 30,695 samples in training, making up a significant portion of the dataset. Task 3, which involves generating responses based on the compressive memory, comprises 31,131 dialogue sessions. A notable feature of the Dolphin dataset is its inclusion of data from 3,998 different AI characters. The diverse character data ensures that COMEDYis well-equipped to interact with various user personalities and preferences, enhancing its adaptability and realism in user interactions.

2.3 COMEDY

SFT Training

In practical, we adopt a mixed task training approach to develop COMEDY. This involves simultaneously training the model on the three tasks - session-level memory summarization, memory compression, and memory-grounded response generation - present in the Dolphin dataset. This integration presents the model with a holistic view of the conversation process, from initial memory extraction to final response generation.We utilize the common language modeling objective in SFT, terming the resulting model as $\mathcal{M}(\theta)_{\text{sft}}$ .

Model	BLEU-1/2	F1	Distinct-1/2
Task 1
COMEDY-7B	41.4 / 34.2	35.4	4.2/35.0
COMEDY-13B	43.0 / 35.0	36.7	3.9/34.3
Task 2
COMEDY-7B	42.7 / 34.6	36.3	4.1/34.4
COMEDY-13B	43.7 / 35.7	37.0	4.1/35.2

Algorithms	Coherence	Consistency	Memorability	Engagingness	Humanness	Average
Context-Only
LLaMA 2-7B	1.01	0.50	0.11	0.31	1.71	0.73
LLaMA 2-13B	0.93	0.66	0.19	0.37	1.76	0.78
Retrieval-based
ChatGPT	1.22	0.86	0.37	0.43	1.51	0.88
LLaMA 2-13B	1.73	0.98	0.51	0.24	1.85	1.06
LLaMA 2-7B	1.70	0.94	0.54	0.31	1.91	1.08
GPT4	1.91	0.94	0.60	0.52	1.69	1.13
Compressive Memory-based
COMEDY-ChatGPT	1.19	1.07	0.60	0.46	1.62	0.99
COMEDY-7B	1.67	1.11	0.60	0.39	1.85	1.12
COMEDY-13B	1.81	1.07	0.70	0.51	1.94	1.21
COMEDY-13B DPO	1.79	1.20	0.80	0.46	2.09	1.27
COMEDY-GPT4	1.96	1.14	0.70	0.73	1.85	1.28

DPO Training

In order to align the model generating more coherent and contextually appropriate memory-grounded responses, we employ Direct Preference Optimization (DPO) Rafailov etal. (2023) strategy in Task 3. DPO aims to distill a referential SFT policy $\mathcal{M}(\theta)_{\text{sft}}$ by polarizing thepreference. Specifically,DPO involves an input labeled pairs ( $Y_{w},Y_{l}$ ) where $Y_{w}$ and $Y_{l}$ denotes thepreferred and dispreferred completion. When extended DPO in Memory-grounded generation, the question is: how we obtain the $Y_{w}$ and $Y_{l}$ ?

To solve this, we propose a simple strategy to automatically construct useful $Y_{w}$ and $Y_{l}$ responses without human annotation. Suppose $\hat{M}$ and $D_{t}$ are given, we ask the GPT4-Turbo to generate the response $Y_{w}$ must align the $\hat{M}$ . Meanwhile, we also require GPT4-Turbo to generate the response $Y_{l}$ that is totally against the $\hat{M}$ . For example, the prompts are illustrated like: “If $\hat{M}$ shows users like something, you should generate the response with the meaning of users hates it…”. Thus, the overall training objective of DPO can be formalized as:

$\mathcal{L}_{\texttt{DPO}}(\mathcal{M}(\theta);\mathcal{M}(\theta)_{\text{sft}})=-\mathbb{E}_{(x,Y_{w},Y_{l})\sim\mathcal{D}}\\\left[\log\sigma\left(\beta\log\frac{\mathcal{M}(\theta)(Y_{w}|x)}{\mathcal{M}(\theta)_{\text{sft}}(Y_{w}|x)}-\beta\log\frac{\mathcal{M}(\theta)(Y_{l}|x)}{\mathcal{M}(\theta)_{\text{sft}}(Y_{l}|x)}\right)\right]$

where x is the concatenation of $\hat{M}$ and $D_{t}$ , $\beta$ is a hyper-parameter.The overview of our training pipeline is shown in Figure 2 and training instruction in Appendix E.

3 Experiments

In this section, we introduce the evaluationsetting including experimental setup (Appendix C), baselines, evaluation metrics, and present main results and discussions.

3.1 Baselines

In this work, COMEDYis compared against models using retrieval-based and context-only approaches to highlight the efficiency and efficacy of its memory compression technique.

Retrieval-based Methods.

We utilize the Text2vec Chinese embedding model²²2https://github.com/shibing624/text2vec in its largest version as the retriever, and then index using FAISS for efficient retrieval. In practice, top 3 retrieved memories are used for testing.

Context-only Approaches.

A comparison is also made with a context-only model, which operates without any memories, to underscore the benefits of memory integration in dialogue systems. This way, the model is trained with the original Task 3 data but without memory as inputs, ensuring a fair comparison with other models.

More broadly, we also build close-source models GPT4 (gpt4-turbo) and ChatGPT (gpt-3.5-turbo) pipelines based on retrieval memory and compressive memory, separately.

Algorithms	Top@1(%)	Top@2 (%)	Top@3 (%)	Top@4 (%)	Average Rank ( $\downarrow$ )
Context-Only
LLaMA 2-7B	4.72	13.39	29.13	63.23	3.89
LLaMA 2-13B	4.72	18.90	33.86	60.08	3.69
Retrieval-based
ChatGPT	8.91	20.47	45.67	69.49	3.48
LLaMA 2-13B	12.73	45.67	66.36	84.61	2.76
LLaMA 2-7B	14.70	45.67	66.93	84.25	2.73
GPT4	22.83	48.03	70.87	85.83	2.63
Compressive Memory-based
COMEDY-ChatGPT	9.45	25.98	48.03	69.29	3.26
COMEDY-7B	24.41	50.39	72.44	87.40	2.59
COMEDY-13B	26.77	53.54	73.23	87.40	2.50
COMEDY-13B DPO	29.92	54.33	77.17	88.98	2.41
COMEDY-GPT4	29.13	60.63	81.10	90.55	2.26

3.2 Evaluation Metrics

Automatic Metrics

We employ standard automatic metrics to measure model performance in Tasks 1&2, including BLEU-1/2 Papineni etal. (2002), F1(Lin, 2004) and Distinct-1/2 Li etal. (2016). These tasks serve as foundational steps for the crucial dialogue generation in Task 3.

Human-based Evaluation

The core of evaluating long-term conversation models primarily centers on validating their performance in Task 3, which involves memory-based dialogue generation.We follow Bae etal. (2022) to access the model performances across five key dimensions:Coherence, Consistency, Engagingness, Humanness and Memorability.To comprehensively measure how well the models perform in Task 3, we combine the Scoring and Ranking approaches. A team of annotators are instructed to rate the model’s performance on these dimensions on a scale from 0 to 3. This scoring system allows for a nuanced evaluation of the model’s capabilities in each specific area. Meanwhile another team of annotators rank all models in terms of their average performance across the five perspectives. While scoring offers detailed insights into each model’s capabilities, ranking places these capabilities in the context of competitive performance. This dual approach ensures a balanced and holistic assessment, capturing both the individual qualities of each model and their comparative effectiveness. Each team has 3 annotators. Each rating scheme in Appendix D.

Recognizing that different models may excel in unique ways, our ranking process is designed to appreciate the diversity in responses.Thus, it is possible for multiple models to share the same rank. This occurs when two or more models demonstrate comparable levels of proficiency or when they each exhibit standout qualities that are equally impressive. This ranking process reflects the complex nature of evaluating conversational LLMs, where different models can excel in different aspects.

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations (3)

3.3 Main Results

Evaluation in Task 1&2.

Table 2 shows that our model achieves relatively high-performances in term of automatic metrics in two tasks. The results show that COMEDYcan effectively recognize the useful persona information and events from the past dialogue sessions and has the ability to condense these session-level memories into a comprehensive compressive memory. Therefore, it ensures the superior performances in generation more coherent memory-grounded responses in Task 3.

Human Evaluation in Task 3

We present the results of human-scored evaluations and rankings for various algorithms in Tables 3 and 4. From the tables, we can draw the following conclusions:

Superiority of Compressive Memory-Based Methods.

The compressive memory-based methods, particularly COMEDY-GPT4, consistently outperform context-only and retrieval-based approaches across most metrics. For instance, COMEDY-GPT4 achieves the highest scores in both Coherence and Engagingness suggesting a superior ability to generate responses that are both contextually appropriate and relatable. COMEDY-GPT4 also achieves best average performances in five evaluating perspectives across scoring and ranking.

Enhancement Through DPO.

The application of DPO further elevates compressive memory strategies, improving dialogue memorability, consistency and humanness. COMEDY-13B DPO shows a notable improvement in performance within the compressive memory-based category. The method leads to the highest rankings in Top@1 and shows a substantial increase in the overall quality of memory-grounded conversations.

SFT models could surpass ChatGPT.

Another interesting findings is that our fine-tuned COMEDY present better performances compared with ChatGPT. Step further, COMEDY-13B DPO even shows comparable performances with GPT4. The results highlight the value of COMEDYframework and Dolphin, which lead to notable improvements in creating memory-grounded responses that are coherent, engaging, and human-like.

Inherent Challenges in Long-Term Dialogue Systems.

It is evident from Table 3 that all models struggle to achieve high scores in real-world long-term conversations, with no model averaging above a score of 2. This underscores the inherent complexity and challenge of this research direction, indicating substantial room for improvement.

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations (4)

3.4 Case Study

Here, we delve into a typical example of a real-world, long-term conversation, where the user and AI engage in light, aimless chatter without any specific goal or topic. When the user inquires, “What are you doing?", the model should use the user’s personal information from previous dialogue sessions to generate an attractive response. This instance underscores the capabilities of our COMEDY in maintaining thorough user information and event summaries from past sessions, aiding the model in formulating coherent and memory-anchored replies. For instance, COMEDY-13B DPO could respond with “I am thinking about how to make your favorite roasted chicken wings.” that is not only coherent but also deeply rooted in the accumulated memory.

On the other hand, retrieval-based methods encounter difficulties in such loosely structured dialogues. The lack of directed conversation impedes these methods from effectively retrieving pertinent memory from the database, often resulting in general responses that lack the distinctiveness of the conversation, like responses from GPT4-Retrieval.

3.5 Discussion

Beyond the main results, we also aim to delve deeper into our framework, discussing and exploring following questions: Q1: Impact of Mix-Training VS. Solo Training in Task 3; Q2: Our Automatic DPO Sample Selection Strategy VS. Random sampling for depreferred samples in DPO (Seen in Appendix F).

Mix-Training VS. Solo Training in Task 3.

We exam the performance changes when COMEDYis mix-trained compared to when it is trained solely on Task 3. Figure 4 reveals that mix-training yields superior performance compared to training COMEDYsolely on Task 3.The significance of the superior performance of mix-training lies in its ability to conserve training resources while achieving a one-for-all model effect across multiple tasks. This efficiency not only streamlines the development process but also enhances the model’s versatility.

4 Conclusion

This paper sets out to explore the frontier of long-term memory-grounded dialogue systems. We present a new framework, named COmpressive Memory-Enhanced Dialogue system (COMEDY) that is a groundbreaking shift from traditional dialogue systems, eschewing the standard retrieval module. This method involves employ a single large language model to extract session-level memories, memory compression and memory-grounded dialogue generation. In our pursuit to align COMEDYwith the nuances of real-world, we collect our training and testing datasets directly from genuine user-chatbot dialogues found online, called Dolphin. Dolphin stands out the current biggest Chinese long-term conversation dataset that consists of more than 100k training samples, supporting three different tasks. Our extensive experiments show COMEDYcould generate more coherent and contextually appropriate memory-grounded responses compared with retrieval-based approaches in terms of comprehensive human evaluation.Future directions include the integration of real-time feedback mechanisms and advanced techniques.

Limitations

Despite the comprehensive nature of our study in evaluating long-term conversational AI systems, several limitations are to be noted:

•
Although, our models COMEDY and collected corpus could contribute in generating more coherent memory-grounded responses in real-world dialogue generation. The overall performances of current dialogue systems are still limited. How to make these models to understand the nature of real-world conversations is a long-standing challenging problem.
•
Other optimization strategies that help the model in maintaining memorability and engagingness are also needed to be explored.

Ethical Concerns

In the development of the Dolphin dataset, prioritizing user privacy and adhering to ethical standards is paramount. This not only ensures compliance with legal requirements but also maintains user trust and the integrity of the system.

•
Special attention is given to minimizing biases in the dataset. This includes ensuring a balanced representation of diverse dialogues and scenarios.
•
Regular audits and reviews of the dataset are conducted to identify and rectify any potential biases or ethical issues.
•
The dataset respects the intellectual property and creative input of users who define AI characters. User-defined characters are used in a way that aligns with the users’ intentions and ethical standards.
•
Care is further taken to avoid any misuse or misrepresentation of these characters in the dataset.

References

Bae etal. (2022)Sanghwan Bae, Donghyun Kwak, Soyoung Kang, MinYoung Lee, Sungdong Kim, Yuin Jeong, Hyeri Kim, Sang-Woo Lee, Woomyoung Park, and Nako Sung. 2022.Keep me updated! memory management in long-term conversations.In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3769–3787, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Brown etal. (2020)TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T.J. Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners.ArXiv.
Cao etal. (2021)YuCao, Liang Ding, Zhiliang Tian, and Meng Fang. 2021.Towards efficiently diversifying dialogue generation via embedding augmentation.In ICASSP.
Chen etal. (2023a)Nuo Chen, Hongguang Li, Junqing He, Yinan Bao, Xinshi Lin, QiYang, Jianfeng Liu, Ruyi Gan, Jiaxing Zhang, Baoyuan Wang, and Jia Li. 2023a.Orca: A few-shot benchmark for Chinese conversational machine reading comprehension.In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15685–15699, Singapore. Association for Computational Linguistics.
Chen etal. (2023b)Nuo Chen, Hongguang Li, Baoyuan Wang, and Jia Li. 2023b.From good to great: Improving math reasoning with tool-augmented interleaf prompting.arXiv preprint arXiv:2401.05384.
Chen etal. (2021)Nuo Chen, Fenglin Liu, Chenyu You, Peilin Zhou, and Yuexian Zou. 2021.Adaptive bi-directional attention: Exploring multi-granularity representations for machine reading comprehension.In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7833–7837. IEEE.
Chen etal. (2023c)Nuo Chen, Linjun Shou, Tengtao Song, Ming Gong, Jian Pei, Jianhui Chang, Daxin Jiang, and Jia Li. 2023c.Structural contrastive pretraining for cross-lingual comprehension.In Findings of the Association for Computational Linguistics: ACL 2023, pages 2042–2057, Toronto, Canada. Association for Computational Linguistics.
Chen etal. (2023d)Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. 2023d.Large language models meet harry potter: A dataset for aligning dialogue agents with characters.In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506–8520, Singapore. Association for Computational Linguistics.
Chen etal. (2023e)Nuo Chen, Zinan Zheng, Ning Wu, Linjun Shou, Ming Gong, Yangqiu Song, Dongmei Zhang, and Jia Li. 2023e.Breaking language barriers in multilingual mathematical reasoning: Insights and observations.arXiv preprint arXiv:2310.20246.
Choi etal. (2023)Eunbi Choi, Kyoung-Woon On, Gunsoo Han, Sungwoong Kim, DanielWontae Nam, Daejin Jo, SeungEun Rho, Taehwan Kwon, and Minjoon Seo. 2023.Effortless integration of memory management into open-domain conversation systems.ArXiv.
Guu etal. (2020)Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020.Retrieval augmented language model pre-training.In ICLR.
Kann etal. (2022)Katharina Kann, Abteen Ebrahimi, JoewieJ. Koh, Shiran Dudy, and Alessandro Roncone. 2022.Open-domain dialogue generation: What we can do, cannot do, and should do next.In NLP4CONVAI.
Lewis etal. (2020)Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.In NeurIPS.
Li etal. (2016)Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016.A diversity-promoting objective function for neural conversation models.In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 110–119. The Association for Computational Linguistics.
Lin (2004)Chin-Yew Lin. 2004.ROUGE: A package for automatic evaluation of summaries.In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu etal. (2016)Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016.How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
Lu etal. (2023)Qingyu Lu, Baopu Qiu, Liang Ding, Liping Xie, and Dacheng Tao. 2023.Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt.arXiv preprint.
OpenAI (2023)OpenAI. 2023.GPT-4 technical report.Arxiv.
Papineni etal. (2002)Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL.
Peng etal. (2023)Keqin Peng, Liang Ding, Qihuang Zhong, LiShen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023.Towards making the most of chatgpt for machine translation.arxiv preprint.
Rafailov etal. (2023)Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, ChristopherD Manning, and Chelsea Finn. 2023.Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290.
Song etal. (2023)Tengtao Song, Nuo Chen, JiJiang, Zhihong Zhu, and Yuexian Zou. 2023.Improving retrieval-based dialogue system via syntax-informed attention.In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Touvron etal. (2023a)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal. 2023a.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971.
Touvron etal. (2023b)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal. 2023b.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
Wang etal. (2023)Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and LiGuo. 2023.Recursively summarizing enables long-term dialogue memory in large language models.arXiv preprint arXiv:2308.15022.
Wu etal. (2023)Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael Lyu. 2023.Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark.arXiv preprint.
Wu etal. (2022)Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. 2022.Memformer: A memory-augmented transformer for sequence modeling.In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 308–318, Online only. Association for Computational Linguistics.
Xu etal. (2022a)Jing Xu, Arthur Szlam, and Jason Weston. 2022a.Beyond goldfish memory: Long-term open-domain conversation.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5180–5197, Dublin, Ireland. Association for Computational Linguistics.
Xu (2022)Jinghua Xu. 2022.Xu at SemEval-2022 task 4: Pre-BERT neural network methods vs post-BERT RoBERTa approach for patronizing and condescending language detection.In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 479–484, Seattle, United States. Association for Computational Linguistics.
Xu etal. (2022b)Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. 2022b.Long time no see! open-domain conversation with long-term persona memory.In Findings of the Association for Computational Linguistics: ACL 2022, pages 2639–2650, Dublin, Ireland. Association for Computational Linguistics.
You etal. (2022)Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, and Yuexian Zou. 2022.End-to-end spoken conversational question answering: Task, dataset and model.In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1219–1232, Seattle, United States. Association for Computational Linguistics.
You etal. (2021)Chenyu You, Nuo Chen, and Yuexian Zou. 2021.Self-supervised contrastive cross-modality representation learning for spoken question answering.In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 28–39, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Zeng etal. (2022)Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, WengLam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, P.Zhang, Yuxiao Dong, and Jie Tang. 2022.Glm-130b: An open bilingual pre-trained model.ArXiv.
Zhang etal. (2018)Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018.Personalizing dialogue agents: I have a dog, do you have pets too?In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
Zhang etal. (2022)Tong Zhang, Yong Liu, Boyang Li, Zhiwei Zeng, Pengwei Wang, Yuan You, Chunyan Miao, and Lizhen Cui. 2022.History-aware hierarchical transformer for multi-session open-domain dialogue system.In Findings of EMNLP.
Zhong etal. (2023a)Qihuang Zhong, Liang Ding, Juhua Liu, BoDu, and Dacheng Tao. 2023a.Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert.arXiv preprint.
Zhong etal. (2023b)Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. 2023b.Memorybank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250.

Appendix A Related Works

Open-domain dialogue systems, commonly known as chatbots or conversational agents, have gained immense popularity due to their wide range of applications, from customer service automation to personal assistants Chen etal. (2023c); Brown etal. (2020); Zeng etal. (2022); Zhong etal. (2023a); Lu etal. (2023); Peng etal. (2023); Wu etal. (2023); Chen etal. (2023d). The surge in research interest is evidenced by the substantial number of studies dedicated to enhancing the capabilities of these systems. This growing body of work reflects the increasing complexity and sophistication expected of chatbots in various settings Xu etal. (2022a); Cao etal. (2021); Bae etal. (2022); Choi etal. (2023); Chen etal. (2023e); You etal. (2022, 2021); Chen etal. (2021, 2023b).Among the myriad challenges these systems face, maintaining long-term conversations is particularly daunting. The capability to understand and memorize key dialogue history information is central to this challenge.

Retrieval-based methods have become increasingly mainstream in the field of long-term conversation within the domain of open-domain dialogue systems. These methods are designed to effectively acquire and utilize key information from past conversations, thereby enhancing the continuity and relevance of ongoing dialogues. Xu (2022) propose to use the memory generator summarizing relevant memories from past sessions, which are then stored in a dedicated memory database. Memory management operations Bae etal. (2022) are also commonly used which involve updating and iterating the memory database to ensure its relevance and accuracy over time. This dynamic management of memory allows the system to adapt to new information and discard outdated or irrelevant data, thereby maintaining an efficient and effective memory repository. Then a retriever module will be employed to obtain the most relevant memories in relation to the current conversation.By combining advanced memory generation, storage, retrieval, these methods enable chatbots to engage in more meaningful, coherent, and contextually rich interactions over extended periods.

While retrieval-based methods offer a promising approach to managing long-term conversations, they are not without their challenges and limitations, including the difficulty of memory database storage and management, and the instability of the retriever module’s performance. To address these concerns, we propose a compressive memory-based framework named COMEDY, which eschews any retrieval module and without need of a huge database. Further, we collect a large-scale real-world long-term conversation dataset Dolphin to support training a well-performed COMEDY.

Appendix B Quality Control

Ensuring high-quality data is paramount for the accuracy, reliability, and overall performance of the system. In this work, we employ several strategies to control the annotation quality:

•
Annotator Performance Monitoring: Regular assessments of annotator performance are conducted to ensure consistent quality across the team. This includes evaluating their accuracy, attention to detail, and adherence to annotation guidelines.
•
Peer Review and Validation: Following the initial review, a secondary level of peer review is implemented. Here, another set of annotators cross-checks the work, providing an additional layer of scrutiny. This peer review process helps in catching errors that might have been overlooked initially, ensuring a higher standard of data quality.

Appendix C Experimental Setup

We use LLaMA 2-13B Touvron etal. (2023a, b) chat model as the backbone of the Task 1 data augmentation.We employ LLaMA 2-7B and 13B chat models as the backbone, allowing to build COMEDY across different scales. We train our models with NVIDIA 8 $\times$ A100 GPUs, setting the max length as 2048, learning rate as 1e-5, epochs as 2, batch size as 32 and 16, separately. For testing, themaximum output tokens are set to 2048 for each task with temperature as 0.5. Following the original setting, we set $\beta$ in DPO as 0.1. In this work, we additionally collect and annotate about 140 dialogue sessions from X Eval as the alignment training set for DPO. We optimize the sft model with batch size 8 and 2 epochs during DPO training.Our codes are based on DeepSpeed Library.

Appendix D Human Evaluation Scheme

For each dialogue session between a human and a chatbot, we engage annotators to assess the quality of the chatbot’s interaction. This evaluation is crucial for understanding the chatbot’s performance from a human-centric perspective.

Rating Scale Description.Annotators rate the chatbot based on several key metrics, using a scale ranging from 0 to 3. This scale is designed to measure the degree of agreement with specific statements about the chatbot’s capabilities:

Coherence:

•
0: “The chatbot’s responses were frequently off-topic or irrelevant.”
•
1: “The chatbot occasionally demonstrated understanding but was mostly incoherent.”
•
2: “The chatbot generally understood the context and responded with coherence.”
•
3: “The chatbot consistently understood the context and responded with perfect coherence.”

Consistency:

•
0: “The chatbot’s responses were erratic and unpredictable throughout the conversation.”
•
1: “The chatbot showed some consistency but was often contradictory.”
•
2: “The chatbot was mostly consistent in the conversation.”
•
3: “The chatbot maintained complete consistency throughout the conversation.”

Engagingness:

•
0: “I had no desire to continue chatting with this chatbot.”
•
1: “I felt only occasionally engaged enough to want to continue the conversation.”
•
2: “I was somewhat engaged and would consider chatting more with this chatbot.”
•
3: “I was fully engaged and would definitely enjoy chatting longer with this chatbot.”

Humanness:

•
0: “The chatbot’s responses felt robotic and unnatural.”
•
1: “The chatbot occasionally sounded human but was mostly mechanical.”
•
2: “The chatbot generally sounded human-like in its responses.”
•
3: “The chatbot’s responses were indistinguishable from a human’s.”

Memorability:

•
0: “The chatbot did not recall any details from earlier in the conversation.”
•
1: “The chatbot occasionally remembered previous conversation points but was mostly forgetful.”
•
2: “The chatbot remembered most of what I said earlier.”
•
3: “The chatbot remembered everything I said previously with proper proactive responses.”

These statements are carefully crafted to capture distinct aspects of the chatbot’s interaction quality, providing a comprehensive overview of its conversational abilities.

The statements for the first four metrics are adapted from previously established literature Bae etal. (2022) in the field, ensuring that our evaluation is grounded in tested and validated research. This continuity allows for comparison with historical data and helps maintain consistency in evaluation standards.Through this structured evaluation process, we can gather nuanced insights into the quality of chatbot interactions, informing further improvements and development in conversational AI systems.

Appendix E Prompts

Here, we show the designed prompts for ChatGPT during dataset annotation Table 5, andpresent the prompts of each task during training in Table 6.

Appendix F Ours VS. Random sampling for depreferred Sample

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations (5)

We compared the performance implications of our proposed strategy for automatically selecting DPO samples against a baseline approach of random sampling of sentences as depreferred samples. In our random sampling implementation, we random sample utterances from previous sessions in the same episode as the deprefered sample.This analysis aims to elucidate the effectiveness of targeted sample selection in enhancing the model’s performance by potentially improving its handling of nuanced dialogue aspects. Figure 5reveals that our proposed automatic simple strategy shows better performances, especially in memorability and humanness, proving its efficiency.

Task 1 prompt that is used for ChatGPT.

This is a dialogue memory generation task, along with user profile and preference generation tasks.

The input consists of the dialogue content between two people.

Firstly, if the dialogue content involves inappropriate content such as sex, p*rnography, or violence, the output should be “Sorry, the content involves sex, p*rnography, violence, etc., and a suitable output cannot be provided."

Secondly, if the dialogue content is idle chat with no effective information, the output should be “No valid information."

The requirements for the dialogue memory generation task are as follows:

Generate objective memory descriptions related to both individuals based on their dialogue content.

Do not omit any relevant dialogue content.

The memories generated should include a subject, verb, and object for each memory.

Separate multiple memory dialogues with ‘

|

’, and include all memories in the format ‘Memory: XXX

|

XXX|

|

XXX’.

The user profile and preference generation task requirements are as follows: This task is only applicable to the users mentioned in the dialogue content, with the user’s name being {user name}.

The user profile includes name, age, birthday, gender, height, weight, zodiac sign, Chinese zodiac sign, hometown, occupation, employer, education, location, and relationship status.

User preferences include likes or dislikes of entities, which can consist of singers, stars, athletes, music, movies, books, anime, variety shows, games, sports, animals, and food.

If there is no user profile and preference information in the dialogue, output ‘No Profile and Preference information available’.

If there is user profile information, output ‘Profile: XXX’. If there is preference information, output ‘Preference: ’.

If both user profile and preference information are present, separate them with ‘###’. The final memory, user profile, and preference information should also be separated with ‘###’ in the format [XXX###XXX###XXX].

The dialogue content is {dialogue}. The output is:

Task 2 prompt that is used for ChatGPT.

This is a task about customizing user descriptions, relationship descriptions, and event descriptions.

The text output is divided into three parts:

The first part is the user description, mainly including a summary of the user’s information.

The second part describes the relationship between the user and the robot.

The third part describes the events shared by the user and the robot.

Based on the reference materials, extract and summarize different information such as the user’s personality traits and behavior patterns.

It is important to record and include all information about the user from various aspects in the user description, without any omissions, resulting in an objective user description.

If the reference materials violate relevant safety regulations, involving sex, p*rnography, violence, etc., the response should be: "Sorry, the content involves sex, p*rnography, violence, etc., and a suitable output cannot be provided."

The user description should include, but is not limited to: basic information (such as name, nickname, gender, appearance, birthday, zodiac sign, etc.), the user’s hobbies and dislikes, and various statuses of the user (such as emotional state, mood, work status, health status, etc.).

The second part is the relationship description between the user and the robot, describing the level of intimacy shown in the dialogue.

The third part is the description of events shared by the user and the robot, summarizing events that have occurred in the dialogue.

In the output description, list specific examples mentioned in the reference materials as much as possible, retaining some interesting information.

However, avoid outputting content unrelated to the user, and keep the content under 500 words.

Let’s think step by step. Each part of the content is separated by ‘###’. The example format is as follows {User Description: XXX###Relationship Description: XXX###Event Description: XXX}.

The output example is as follows: The user’s personality is particularly XXX, because they once XXX, and the user likes XXX, dislikes XXX.

The user’s name is {user name}, the robot’s name: {chatbot name} and the reference material is {multiple session-level memories}.

The output is:

Task 3 prompt that is used for ChatGPT.

This is a memory-based dialogue generation task.

Given a dialogue and related memory content, please generate a response that is consistent with the memory content and reasonable within the context of the dialogue.

Dialogue: {Dialogue}

Memory: {Memory}

Task 1 prompt in instruction tuning.

This is a memory description generation task

In this task, you should base on the dialogue content between two people, create objective memory descriptions for both individuals, represented in the format [xxx|xxx|xxx], where each ’xxx’ is a separate memory.

The memories should use the names of the speakers as the subject, and all relevant dialogue content must not be omitted. Separate different memories with ’|’.

Dialogue content is: {Dialogue}.

Output is:

Task 2 prompt in instruction tuning.