from llama_index import LLMPredictor, VectorStoreIndex
from langchain import OpenAI
os.environ["OPENAI_API_KEY"] = "api-key"
index = VectorStoreIndex(nodes)
建立檢索器
我們將使用 VectorIndexRetriever,它會根據相似度檢索出前 k 個匹配的文件。在這個例子中,我們將 k 設為 2。
from llama_index.retrievers import VectorIndexRetriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=2,
)
建立查詢引擎
現在我們可以在檢索器之上構建一個查詢引擎來開始進行查詢。
from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(
retriever=retriever
)
詢問問題
現在我們可以在檢索器上建立一個查詢引擎來開始進行查詢。
response = query_engine.query("What did the author do growing up?")
print(response)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK":表示成功發送了 HTTP POST 請求到 https://api.openai.com/v1/embeddings 並收到了 200 OK 的響應。
# 載入已存在的索引
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
# 使用索引的資料詢問問題
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
則可看到得到的回應如下:
The author worked on writing short stories and programming, starting with the IBM 1401 in 9th grade, using an early version of Fortran. Later, the author transitioned to microcomputers, particularly a TRS-80, where they wrote simple games, a rocket prediction program, and a word processor.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Load text data
text_data = ["This is a sample text.", "Another text sample.", ...]
# Preprocess text data (e.g., remove special characters, convert to lowercase)
preprocessed_text = [t.strip().lower() for t in text_data]
# Tokenize text data
tokenized_text = [t.split() for t in preprocessed_text]
# Vectorize tokenized text using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(tokenized_text)
# Assign labels to the vectorized data (e.g., sentiment analysis, classification)
y = ...
# Create a Pandas DataFrame with the preprocessed data and labels
df = pd.DataFrame({'text': preprocessed_text, 'label': y})
# Save the dataset to a file or database for future use
df.to_csv('nlp_dataset.csv', index=False)
用於整理 NLP 數據集的工具和資源
NLTK:NLTK(Natural Language Toolkit)是一個廣泛使用的 Python 庫,用於自然語言處理。它包含許多用於數據清理、預處理和標註的工具。NLTK 文檔: https://www.nltk.org/book/