一款RAG与LLaMA-3.1-8B相结合的PDF文档分析工具！

01。

概述

一款利用检索增强生成（RAG）技术和LLaMA-3.1-8B即时大型语言模型（LLM）的个人助理工具。该工具旨在通过结合机器学习和基于检索的系统，彻底改变PDF文档分析任务。

02。

RAG架构的起源

检索增强生成（RAG）是一种在自然语言处理（NLP）领域具有强大效能的技术，它将基于检索的方法与生成模型相结合，以产生更准确且与上下文相关的输出结果。这一方法最初由Facebook AI Research（FAIR）在2020年发表的论文《Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks》中提出。

想要深入了解RAG及其相关知识，可以参考Facebook AI Research的原始论文：《Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks》。

https://arxiv.org/pdf/2005.11401

03。

RAG 架构概述

RAG模型由三个主要部分组成：

索引器：该组件创建语料库的索引，以便于高效检索相关文档。检索器：该组件根据输入的查询，在索引化的语料库中检索相关文档。生成器：该组件根据检索到的文档生成相应的回应。

04。

实现细节

RAG模型的训练分为三个阶段：

索引器训练：索引器被训练以创建查询与文档之间的高效准确映射。检索器训练：检索器被训练以最大化相关文档的相关性得分。生成器训练：生成器被训练以提高真实响应的概率最大化。

在推理过程中，RAG模型遵循以下步骤：

索引：对语料库进行索引，以便于高效检索。检索：根据给定查询的相关性得分，检索出排名最高的文档。生成：根据输入的查询和检索到的文档生成回应。最终的回应是通过如上所述对检索到的文档进行边缘化处理获得的。

05。

安装

Install Packages

!conda install -n pa \pytorch \torchvision \torchaudio \cpuonly \-c pytorch \-c conda-forge \--yes%pip install -U ipywidgets%pip install -U requests%pip install -U llama-index%pip install -U llama-index-embeddings-huggingface%pip install -U llama-index-llms-groq%pip install -U groq%pip install -U gradio

Install Tesseract

import osimport platformimport subprocessimport requestsdef install_tesseract():"""Installs Tesseract OCR based on the operating system."""os_name = platform.system()if os_name == "Linux":print("Detected Linux. Installing Tesseract using apt-get...")subprocess.run(["sudo", "apt-get", "update"], check=True)subprocess.run(["sudo", "apt-get", "install", "-y", "tesseract-ocr"], check=True)elif os_name == "Darwin":print("Detected macOS. Installing Tesseract using Homebrew...")subprocess.run(["brew", "install", "tesseract"], check=True)elif os_name == "Windows":tesseract_installer_url = "https://github.com/UB-Mannheim/tesseract/releases/download/v5.4.0.20240606/tesseract-ocr-w64-setup-5.4.0.20240606.exe"installer_path = "tesseract-ocr-w64-setup-5.4.0.20240606.exe"response = requests.get(tesseract_installer_url)with open(installer_path, "wb") as file:file.write(response.content)tesseract_path = r"C:\Program Files\Tesseract-OCR"os.environ["PATH"] += os.pathsep + tesseract_pathtry:result = subprocess.run(["tesseract", "--version"], check=True, capture_output=True, text=True)print(result.stdout)except subprocess.CalledProcessError as e:print(f"Error running Tesseract: {e}")else:print(f"Unsupported OS: {os_name}")install_tesseract()

Convert PDF to OCR

import webbrowserurl = "https://www.ilovepdf.com/ocr-pdf"webbrowser.open_new(url)

Import Libraries

import osfrom llama_index.core import (Settings,VectorStoreIndex,SimpleDirectoryReader,StorageContext,load_index_from_storage)from llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.core.node_parser import SentenceSplitterfrom llama_index.llms.groq import Groqimport gradio as gr

参考：

1.https://github.com/mytechnotalent/pa?tab=readme-ov-file

玩酷网

一款RAG与LLaMA-3.1-8B相结合的PDF文档分析工具！

智能科技扫地僧