通过言语反思进行强化

原文：https://medium.com/@JacekWo/reinforcement-via-verbal-reflection-92d3ec18e2bf

通过言语反思进行强化（称为反思，请参阅：下面的文献）是人工智能和机器学习领域的一种创新方法，特别是在强化学习领域。这种方法利用语言模型的自然语言处理功能来促进更直观和更像人类的学习过程。该方法的核心是围绕四个关键组件之间的动态相互作用：行动者、评估者、自我反思和记忆。这些组件在循环过程中协同工作，使系统能够从其行动中学习并随着时间的推移改进其决策。

Actor 是系统内的决策实体。它观察环境的当前状态，其中不仅包括手头任务的直接背景（例如，需要验证的声明），还包括其记忆中的相关历史背景。基于这种综合视图，Actor 会采取行动。例如，在事实核查系统的背景下，该行动可能会对索赔的有效性产生初步裁决。演员的决定受到其过去的经验和随着时间的推移积累的知识的影响，这些知识存储在其记忆中。

评估者的作用是评估行为者采取的行动。它以奖励的形式提供反馈，奖励是衡量行动成功或正确性的指标。评估者的判断可以基于预定义的标准、已知的事实，甚至是外部信息来源。奖励是一个关键的信号，引导行为者了解其行为的后果，并了解哪些行为会导致成功，哪些行为不会。

自我反思是一个元认知组件，系统在其中考虑其行动、从评估者那里收到的反馈以及其决策过程的更广泛背景。这个内省过程是用自然语言进行的，使系统能够阐明见解，确定需要改进的领域，并就如何完善未来的行动制定战略。这里产生的反思信息丰富，提供了一个叙述，概括了系统对其性能的理解和自我改进计划。

在这个框架中，记忆有两个主要目的：它为当前决策提供背景，并作为未来学习的经验库。记忆分为短期和长期部分。短期记忆保留了最近的经验和即时反馈，为当前的决策提供了详细的背景。另一方面，长期记忆存储了随着时间的推移获得的提炼知识和见解，包括系统产生的有价值的反思。这种分层的记忆结构使行动者不仅能够根据直接的背景做出决定，而且还可以根据其积累的经验和学习的丰富内容做出决定。

通过言语反思进行强化的过程是周期性和动态的。Actor 根据当前状态及其记忆采取行动，评估者评估该行动并提供反馈，然后系统进行自我反思，思考反馈和自己的行动，以获得可操作的见解。然后，这些见解被集成到内存中，不断丰富系统的知识库和决策框架。这个过程促进了一个复杂的学习循环，系统不仅可以从直接反馈中学习，还可以更深入地理解其行动和决策，从而实现持续改进和适应。

反思框架，来源：作者，见文献

假设可以访问事实来源（Wikipedia API），这些是使用 Reflexion 代理的可能场景：

事实核查代理：

场景：代理收到声明或声明，需要使用来自维基百科的信息验证其准确性。反思：代理人可以反思来源的可靠性或用于验证信息的方法，随着时间的推移改进其事实核查策略。

互动学习和辅导：

场景：代理充当特定主题的导师，使用维基百科提供详细的解释或回答问题。反思：在每次互动之后，智能体会反思其解释的清晰度和学生的反应（如果有的话），以改进其教学方法。

内容总结与分析：

场景：代理收到冗长的文章或文档，并且必须提供简明扼要的摘要或主题分析。反思：代理将摘要或分析与源材料或反馈进行比较，学习如何更有效地捕捉要点。

历史事件重建：

场景：代理被要求根据维基百科中的多个条目重建历史事件的时间线或叙述。反思：智能体反思其构建的叙事的连贯性，并调整其方法以确保准确性和逻辑流畅性。

政策辩论与讨论：

场景：代理参与关于政策问题的辩论，使用来自维基百科的信息来支持其论点。反思：辩论结束后，代理人反思其论点的优缺点以及提出的反驳，旨在提高其推理和修辞技巧。

创意写作和讲故事：

场景：代理制作故事或创意文本，使用维基百科来获得事实准确性或灵感。反思：代理反思角色发展、情节连贯性和读者参与度等叙事元素，以提高其讲故事的能力。

技术支持和故障排除：

场景：代理提供技术支持，使用维基百科了解与特定产品或技术相关的常见问题和解决方案。反思：代理审查其故障排除建议的有效性，并改进其诊断和解决问题的方法。

文化探索与旅游规划：

场景：代理协助规划旅行行程或探索有关不同地点的文化信息，使用维基百科获取详细准确的信息。反思：代理评估其提供的信息的相关性和有用性，旨在增强用户的旅行计划或文化探索体验。

我将尝试基于Reflexion框架的简单重新实现来实现Fact-Checking Agent。有些部分很棘手，取决于用例，尤其是对参与者动作的评估。

GPTBot 充当语言模型的接口，处理通信并确保根据当前上下文和历史记录生成响应。

class GPTBot: def __init__(self, model_name, api_key): self.model_name = model_name openai.api_key = openai.api_key self.history = [] # Store the full conversation history def __call__(self, message): # Clear history for each new query to start fresh self.history = [] self.history.append({"role": "user", "content": message}) return self.execute() def execute(self): try: completion = openai.chat.completions.create( model=self.model_name, messages=self.history ) response = completion.choices[0].message.content self.history.append({"role": "assistant", "content": response}) return response except Exception as e: logging.error(f"Error during API call: {e}") return "Sorry, I encountered an error." def reset(self): """Reset the conversation history.""" self.history = []

代理充当 Reflexion 框架内的决策实体。它通过包括来自短期记忆的近期经历和来自长期记忆的见解来构建每个声明的上下文。

def actor(claim, bot, short_term_memory, long_term_memory): message = f"Claim: {claim}\n" message += "Recent experiences and verdicts:\n" for experience in short_term_memory: message += f"- Claim: {experience['claim']}, Verdict: {experience['verdict']}, Info: {experience['info']}\n" message += "Past reflections:\n" for reflection in long_term_memory: message += f"- {reflection}\n" message += "Analyze the claim's validity and gather supporting or contradicting information from Wikipedia." response = bot(message) preliminary_verdict, gathered_information = interpret_response(response) return preliminary_verdict, gathered_information

评估者评估参与者采取的行动。它负责评估演员的判决和决定的正确性和相关性。

def evaluator(claim, preliminary_verdict): reward = compute_reward(claim, preliminary_verdict) return rewarddef compute_reward(claim, preliminary_verdict): expected_truths = { "The Golden Gate is located in London.": False, "Edward Witten is the first physicist who received Fields Medal.": True, "The Great Wall of China is visible from the Moon.": False, } expected_truth = expected_truths.get(claim) # Normalize the verdict for case-insensitive comparison verdict_lower = preliminary_verdict.lower() # Define phrases that indicate a claim is considered true or false indicators_of_truth = ['true', 'valid', 'correct', 'accurate'] indicators_of_falsehood = ['false', 'not true', 'not valid', 'debunked', 'myth', 'not accurate'] # Check the alignment of the verdict with the expected truth if expected_truth is True and any(indicator in verdict_lower for indicator in indicators_of_truth): return 1.0 elif expected_truth is False and any(indicator in verdict_lower for indicator in indicators_of_falsehood): return 1.0 return 0.0

自我反省根据索赔、收集的信息、初步判决和收到的奖励产生反思。这种反思旨在指导未来行动的改进。

def self_reflection(claim, gathered_information, preliminary_verdict, reward): reflection = f"Claim: {claim}, Verdict: {preliminary_verdict}, Reward: {reward}. " reflection += "Consider improving information gathering and analysis for better accuracy." return reflection

短期记忆更新了最新的索赔，收集了信息，初步判决，并获得了奖励。如果短期内存超过其最大大小，则删除最旧的条目。

def update_short_term_memory(short_term_memory, claim, gathered_information, preliminary_verdict, reward): short_term_memory.append({ 'claim': claim, 'info': gathered_information, 'verdict': preliminary_verdict, 'reward': reward }) if len(short_term_memory) > MAX_SHORT_TERM_MEMORY_SIZE: short_term_memory.pop(0)

长期记忆会随着被认为有价值的反思而更新。

def update_long_term_memory(long_term_memory, reflection): if is_valuable(reflection): long_term_memory.append(reflection)

从机器人解析响应，以将 preliminary_verdict 和 gathered_information

def interpret_response(response): split_response = response.split(". ") preliminary_verdict = split_response[0].replace("Verdict: ", "") gathered_information = split_response[1].replace("Information: ", "") return preliminary_verdict, gathered_information

检查反射是否包含指示有价值见解的特定关键字的简单实现。

def is_valuable(reflection): # Define keywords that indicate valuable insights valuable_keywords = ['improved', 'new insight', 'significant', 'actionable step', 'corrected'] # Check if the reflection contains any of the valuable keywords for keyword in valuable_keywords: if keyword in reflection.lower(): return True # If none of the valuable keywords are found, consider the reflection less valuable return False

要处理的确认清单。

def get_claims(): claims = [ "The Golden Gate is located in London.", "Edward Witten is the first physicist who received Fields Medal.", "The Great Wall of China is visible from the Moon." ] return claims

主循环编排整个过程。对于每个声明，它都会重置机器人，通过 Actor 生成 preliminary_verdict and gathered_information ，评估判决以计算奖励，更新记忆并生成反射。最后，它输出每个声明的结果。

def main(): claims = get_claims() short_term_memory = [] long_term_memory = [] bot = GPTBot(model_name="gpt-3.5-turbo-1106", api_key="your-openai-api-key") for claim in claims: bot.reset() preliminary_verdict, gathered_information = actor(claim, bot, short_term_memory, long_term_memory) #reward = evaluator(gathered_information, preliminary_verdict) reward = evaluator(claim, preliminary_verdict) #print(f"Reward: {reward}") update_short_term_memory(short_term_memory, claim, gathered_information, preliminary_verdict, reward) reflection = self_reflection(claim, gathered_information, preliminary_verdict, reward) update_long_term_memory(long_term_memory, reflection) print(f"Claim: {claim}, Verdict: {preliminary_verdict}, Reward: {reward}, Reflection: {reflection}")if __name__ == "__main__": main()

结果如下：

Claim 1: The Golden Gate is located in London., Verdict: The claim that the Golden Gate is located in London is not valid, Reward: 1.0.Claim 2: Edward Witten is the first physicist who received Fields Medal., Verdict: The claim that Edward Witten is the first physicist who received the Fields Medal is not valid, Reward: 1.0.Claim 3: The Great Wall of China is visible from the Moon., Verdict: The claim that the Great Wall of China is visible from the Moon has been widely debunked, Reward: 1.0.

根据我的知识，权利要求 1、3 得到了正确的解决。关于权利要求2，解决方案是错误的。爱德华·威滕教授是第一位获得菲尔兹奖的物理学家。以下是 chatGPT4 对我问题的回答：

Edward Witten is the first physicist who received Fields Medal. True or false?ChatGPTFalse. Edward Witten is not the first physicist to receive the Fields Medal. He did receive the Fields Medal, which is often regarded as the highest honor a mathematician can receive, but he was not the first. The Fields Medal has been awarded to many mathematicians since it was first established in 1936. Edward Witten received it in 1990 for his contributions to mathematical physics. The distinction of being the first physicist to receive the Fields Medal likely adds to the common confusion, as Witten's work beautifully bridges the gap between mathematics and physics.

结论

Reflexion 框架虽然具有创新性和前途，尤其是在将自然语言处理与强化学习相结合方面，但确实存在某些局限性和潜在的弱点：

对语言质量模型的依赖性：Reflexion 框架的有效性在很大程度上依赖于底层语言模型。语言模型在理解、推理或生成连贯且上下文准确的响应方面的任何限制都将直接影响 Reflexion 代理的性能。评估反射的困难：评估自我反射组件产生的反射的质量和有用性可能具有挑战性。它需要复杂的机制来评估反思是否提供了有意义的见解或可操作的改进建议。处理歧义和细微差别：自然语言本质上是模棱两可的，并且依赖于上下文。Reflexion 框架必须能够理解和处理这些细微差别，尤其是在解释声明、生成判决和产生反思时。跨领域的泛化：虽然 Reflexion 框架在特定领域可能很强大，但在广泛的任务或主题领域中泛化其功能可能很困难。定制系统以在一个领域表现良好并不一定能转化为另一个领域的成功。反馈回路稳定性：系统的学习受到涉及 Actor、Evaluator 和 Self-Reflection 组件的反馈回路的影响。确保这种反馈循环是稳定的，并导致有意义的学习，而不是强化不正确的行为或偏见，这一点至关重要。

解决这些弱点需要精心设计、严格的测试和不断完善系统。自然语言处理、强化学习和领域适应方面的先进技术，以及有效的数据管理和模型评估策略，对于实现 Reflexion 框架的全部潜力至关重要。

文献

Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu YaoReflexion： Language Agents with Verbal Reinforcement Learning — Noah Shinn， Federico Cassano， Edward Berman， Ashwin Gopinath， Karthik Narasimhan， Shunyu Yao

玩酷网

通过言语反思进行强化

热门分类