Actor 是系统内的决策实体。它观察环境的当前状态,其中不仅包括手头任务的直接背景(例如,需要验证的声明),还包括其记忆中的相关历史背景。基于这种综合视图,Actor 会采取行动。例如,在事实核查系统的背景下,该行动可能会对索赔的有效性产生初步裁决。演员的决定受到其过去的经验和随着时间的推移积累的知识的影响,这些知识存储在其记忆中。




通过言语反思进行强化的过程是周期性和动态的。Actor 根据当前状态及其记忆采取行动,评估者评估该行动并提供反馈,然后系统进行自我反思,思考反馈和自己的行动,以获得可操作的见解。然后,这些见解被集成到内存中,不断丰富系统的知识库和决策框架。这个过程促进了一个复杂的学习循环,系统不仅可以从直接反馈中学习,还可以更深入地理解其行动和决策,从而实现持续改进和适应。


假设可以访问事实来源(Wikipedia API),这些是使用 Reflexion 代理的可能场景:

















我将尝试基于Reflexion框架的简单重新实现来实现Fact-Checking Agent。有些部分很棘手,取决于用例,尤其是对参与者动作的评估。

GPTBot 充当语言模型的接口,处理通信并确保根据当前上下文和历史记录生成响应。

class GPTBot: def __init__(self, model_name, api_key): self.model_name = model_name openai.api_key = openai.api_key self.history = [] # Store the full conversation history def __call__(self, message): # Clear history for each new query to start fresh self.history = [] self.history.append({"role": "user", "content": message}) return self.execute() def execute(self): try: completion = openai.chat.completions.create( model=self.model_name, messages=self.history ) response = completion.choices[0].message.content self.history.append({"role": "assistant", "content": response}) return response except Exception as e: logging.error(f"Error during API call: {e}") return "Sorry, I encountered an error." def reset(self): """Reset the conversation history.""" self.history = []

代理充当 Reflexion 框架内的决策实体。它通过包括来自短期记忆的近期经历和来自长期记忆的见解来构建每个声明的上下文。

def actor(claim, bot, short_term_memory, long_term_memory): message = f"Claim: {claim}\n" message += "Recent experiences and verdicts:\n" for experience in short_term_memory: message += f"- Claim: {experience['claim']}, Verdict: {experience['verdict']}, Info: {experience['info']}\n" message += "Past reflections:\n" for reflection in long_term_memory: message += f"- {reflection}\n" message += "Analyze the claim's validity and gather supporting or contradicting information from Wikipedia." response = bot(message) preliminary_verdict, gathered_information = interpret_response(response) return preliminary_verdict, gathered_information


def evaluator(claim, preliminary_verdict): reward = compute_reward(claim, preliminary_verdict) return rewarddef compute_reward(claim, preliminary_verdict): expected_truths = { "The Golden Gate is located in London.": False, "Edward Witten is the first physicist who received Fields Medal.": True, "The Great Wall of China is visible from the Moon.": False, } expected_truth = expected_truths.get(claim) # Normalize the verdict for case-insensitive comparison verdict_lower = preliminary_verdict.lower() # Define phrases that indicate a claim is considered true or false indicators_of_truth = ['true', 'valid', 'correct', 'accurate'] indicators_of_falsehood = ['false', 'not true', 'not valid', 'debunked', 'myth', 'not accurate'] # Check the alignment of the verdict with the expected truth if expected_truth is True and any(indicator in verdict_lower for indicator in indicators_of_truth): return 1.0 elif expected_truth is False and any(indicator in verdict_lower for indicator in indicators_of_falsehood): return 1.0 return 0.0


def self_reflection(claim, gathered_information, preliminary_verdict, reward): reflection = f"Claim: {claim}, Verdict: {preliminary_verdict}, Reward: {reward}. " reflection += "Consider improving information gathering and analysis for better accuracy." return reflection


def update_short_term_memory(short_term_memory, claim, gathered_information, preliminary_verdict, reward): short_term_memory.append({ 'claim': claim, 'info': gathered_information, 'verdict': preliminary_verdict, 'reward': reward }) if len(short_term_memory) > MAX_SHORT_TERM_MEMORY_SIZE: short_term_memory.pop(0)


def update_long_term_memory(long_term_memory, reflection): if is_valuable(reflection): long_term_memory.append(reflection)

从机器人解析响应,以将 preliminary_verdict 和 gathered_information

def interpret_response(response): split_response = response.split(". ") preliminary_verdict = split_response[0].replace("Verdict: ", "") gathered_information = split_response[1].replace("Information: ", "") return preliminary_verdict, gathered_information


def is_valuable(reflection): # Define keywords that indicate valuable insights valuable_keywords = ['improved', 'new insight', 'significant', 'actionable step', 'corrected'] # Check if the reflection contains any of the valuable keywords for keyword in valuable_keywords: if keyword in reflection.lower(): return True # If none of the valuable keywords are found, consider the reflection less valuable return False


def get_claims(): claims = [ "The Golden Gate is located in London.", "Edward Witten is the first physicist who received Fields Medal.", "The Great Wall of China is visible from the Moon." ] return claims

主循环编排整个过程。对于每个声明,它都会重置机器人,通过 Actor 生成 preliminary_verdict and gathered_information ,评估判决以计算奖励,更新记忆并生成反射。最后,它输出每个声明的结果。

def main(): claims = get_claims() short_term_memory = [] long_term_memory = [] bot = GPTBot(model_name="gpt-3.5-turbo-1106", api_key="your-openai-api-key") for claim in claims: bot.reset() preliminary_verdict, gathered_information = actor(claim, bot, short_term_memory, long_term_memory) #reward = evaluator(gathered_information, preliminary_verdict) reward = evaluator(claim, preliminary_verdict) #print(f"Reward: {reward}") update_short_term_memory(short_term_memory, claim, gathered_information, preliminary_verdict, reward) reflection = self_reflection(claim, gathered_information, preliminary_verdict, reward) update_long_term_memory(long_term_memory, reflection) print(f"Claim: {claim}, Verdict: {preliminary_verdict}, Reward: {reward}, Reflection: {reflection}")if __name__ == "__main__": main()


Claim 1: The Golden Gate is located in London., Verdict: The claim that the Golden Gate is located in London is not valid, Reward: 1.0.Claim 2: Edward Witten is the first physicist who received Fields Medal., Verdict: The claim that Edward Witten is the first physicist who received the Fields Medal is not valid, Reward: 1.0.Claim 3: The Great Wall of China is visible from the Moon., Verdict: The claim that the Great Wall of China is visible from the Moon has been widely debunked, Reward: 1.0.

根据我的知识,权利要求 1、3 得到了正确的解决。关于权利要求2,解决方案是错误的。爱德华·威滕教授是第一位获得菲尔兹奖的物理学家。以下是 chatGPT4 对我问题的回答:

Edward Witten is the first physicist who received Fields Medal. True or false?ChatGPTFalse. Edward Witten is not the first physicist to receive the Fields Medal. He did receive the Fields Medal, which is often regarded as the highest honor a mathematician can receive, but he was not the first. The Fields Medal has been awarded to many mathematicians since it was first established in 1936. Edward Witten received it in 1990 for his contributions to mathematical physics. The distinction of being the first physicist to receive the Fields Medal likely adds to the common confusion, as Witten's work beautifully bridges the gap between mathematics and physics.


Reflexion 框架虽然具有创新性和前途,尤其是在将自然语言处理与强化学习相结合方面,但确实存在某些局限性和潜在的弱点:

对语言质量模型的依赖性:Reflexion 框架的有效性在很大程度上依赖于底层语言模型。语言模型在理解、推理或生成连贯且上下文准确的响应方面的任何限制都将直接影响 Reflexion 代理的性能。评估反射的困难:评估自我反射组件产生的反射的质量和有用性可能具有挑战性。它需要复杂的机制来评估反思是否提供了有意义的见解或可操作的改进建议。处理歧义和细微差别:自然语言本质上是模棱两可的,并且依赖于上下文。Reflexion 框架必须能够理解和处理这些细微差别,尤其是在解释声明、生成判决和产生反思时。跨领域的泛化:虽然 Reflexion 框架在特定领域可能很强大,但在广泛的任务或主题领域中泛化其功能可能很困难。定制系统以在一个领域表现良好并不一定能转化为另一个领域的成功。反馈回路稳定性:系统的学习受到涉及 Actor、Evaluator 和 Self-Reflection 组件的反馈回路的影响。确保这种反馈循环是稳定的,并导致有意义的学习,而不是强化不正确的行为或偏见,这一点至关重要。

解决这些弱点需要精心设计、严格的测试和不断完善系统。自然语言处理、强化学习和领域适应方面的先进技术,以及有效的数据管理和模型评估策略,对于实现 Reflexion 框架的全部潜力至关重要。


Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao

