使用LlamaIndex載入多種類文件

LlamaIndex的教學資料

這邊有許多的簡單範例:

https://github.com/SamurAIGPT/LlamaIndex-course

這邊則是載入文件的範例:

https://github.com/SamurAIGPT/LlamaIndex-course/blob/main/dataconnectors/Data_Connectors.ipynb

範例程式載入PDF如下

from pathlib import Path
from llama_index.core import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()

pdf_document = loader.load_data(file=Path('./sample.pdf'))

載入YouTube的字幕的範例如下

from llama_index.core import download_loader

YoutubeTranscriptReader = download_loader("YoutubeTranscriptReader")

loader = YoutubeTranscriptReader()
youtube_documents = loader.load_data(ytlinks=['https://www.youtube.com/watch?v=nHcbHdgVUJg&ab_channel=WintWealth'])

使用上面的寫法我們會發現會跳出這樣的警告

DeprecationWarning: Call to deprecated function (or staticmethod) download_loader. (`download_loader()` is deprecated. Please install tool using pip install directly instead.) PDFReader = download_loader(“PDFReader”)
錯誤警告訊息

新的文件讀取方式

現在官方推薦的檔案讀取方式如下

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./files").load_data()

不過，如果我們需要這個Reader使用特別的解析器去解析特別格式的文件的話，則要使用額外相關的函式庫，如下面的介紹

讀取各式文件的函式庫

首先先安裝所需的套件

pip install llama-index-readers-file

那Reader頁面中許多各式各樣的Reader則請參考此文件:

https://llamahub.ai/l/readers/llama-index-readers-file?from=readers

使用範例如下，下面這子我們會可以用doc.text去取得這張圖片的一個文字描述

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import ImageCaptionReader
# Image Reader example
parser = ImageCaptionReader()
file_extractor = {
    ".jpg": parser,
    ".jpeg": parser,
    ".png": parser,
}  # Add other image formats as needed
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()
for index, doc in enumerate(documents):
    print(doc.text)

要注意的是，當我們這邊使用圖片閱讀器，事實上他會載入Hugging Face的一些transformers模型去做圖片辨識，有一些模型只能使用GPU，所以，我們一定要記得我們的Pytorch要使用GPU版本的

參考此文件: https://pytorch.org/get-started/locally/

如果已經安裝了CPU版本的，記得先把torch反安裝後再重新安裝GPU版本