Tesseract – Google開源的光學文字辨識系統

關於Tesseract

Tesseract 是一個開源的光學字符識別（OCR）引擎，能夠將圖像中的文本轉換為可編輯的文本。它由 Google 維護和開發，支持多種語言和字符集。

GitHub位置: https://github.com/tesseract-ocr/tesseract

Tesseract 4 添加了一個新的基於神經網路（LSTM）的 OCR 引擎，該引擎專注於行識別，但仍然支援 Tesseract 3 的傳統 Tesseract OCR 引擎，該引擎通過識別字元模式來工作。使用舊版 OCR 引擎模式（–oem 0）啟用與 Tesseract 3 的相容性。它還需要支援舊引擎的 traineddata 檔，例如來自 tessdata 儲存庫的檔。

Tesseract 支援 unicode （UTF-8），可以「開箱即用」地識別 100 多種語言。支援多種圖像格式，包括 PNG、JPEG 和 TIFF。支援各種輸出格式：純文本、hOCR （HTML）、PDF、不可見文本 PDF、TSV、ALTO 和 PAGE。

主要功能和特點

多語言支持： Tesseract 支持超過 100 種語言，包括繁體中文。
高準確度： Tesseract 在文本識別方面具有較高的準確度，特別是經過適當的預處理後。
易於集成： Tesseract 可以與多種編程語言和工具集成，例如 Python、C++、Java 等，方便開發者在不同的應用場景中使用。
開源和免費： Tesseract 是開源軟件，可以自由使用和修改。

安裝 Tesseract

需要安裝兩個部分：引擎本身和語言的 traineddata。超過 130 種語言和超過 35 種腳本的軟體包也可以直接從 Linux 發行版獲得。語言 traineddata 包稱為“tesseract-ocr-langcode”和“tesseract-ocr-script-scriptcode”，其中 langcode 是三個字母的語言代碼， scriptcode 是四個字母的腳本代碼。

安裝教學: https://tesseract-ocr.github.io/tessdoc/InstallationOpenSuse.html

以 root 身份運行以下命令：(CentOS7)

yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract 
yum install tesseract-langpack-deu

使用方式

從發佈頁面下載AppImage
開啟終端應用程式
流覽到 AppImage 的位置
使 AppImage 可執行：$ chmod a+x tesseract*.AppImage
執行它：./tesseract*.AppImage -l eng page.tif page.txt

安裝 pytesseract

pytesseract 是一個 Python 包裝器，用於調用 Tesseract OCR 引擎。

pip install pytesseract
pip install pillow

使用 Tesseract 進行 OCR

from PIL import Image
import pytesseract

# 設定 tesseract 執行檔的路徑
pytesseract.pytesseract.tesseract_cmd = r'/usr/local/bin/tesseract'  # 替換為你的 tesseract 安裝路徑

# 打開圖像文件
image = Image.open('example.png')

# 使用 Tesseract 進行 OCR
text = pytesseract.image_to_string(image, lang='chi_tra')  # 使用繁體中文語言包
print(text)