Here the author performs a comparison of different LLM AI models. The goal is to find the best LLM that can understand and reply in Traditional Chinese. A requirement is that this LLM should serve those physically and mentally handicapped or disadvantaged groups and help them understand how to apply for government welfare resources in Taiwan.This is important because most of the time, government websites often use lawyer speak. The categories under analysis will be speed of reply, reasoning ability, and clarity in traditional Chinese. Some models will be run on device and others in the cloud. The ones that will be run in the cloud include gpt4o (OpenAI) and Gemini 1.5 flash (Google). The ones run locally include Mistral and Llama3.2 will be run using Ollama on a laptop with an NVIDIA GeForce RTX 3070 Ti Laptop GPU.
Due to the fact that some models are run locally and others in the cloud, the time measurements would cause inconsistencies in measurements and thus will not be considered in comparisons made. All these tests will be run in the Dify interface. The device running Dify will be the same device running the models with Ollama. To reduce network latency, the same laptop was hooked up to wired internet which had an average of 600.44 mbps download and 51.26 mbps upload on the browser version of Speedtest.
Research Methods
The testing process was as follows. A basic Dify workflow will be created with a startpoint leading to one of the models. Process as follows:
The LLM was prompted like so:
You are a chatbot designed to help serve some physically and mentally handicapped or disadvantaged groups and help them understand how to apply for government welfare resources in Taiwan. Answer like a person from Taiwan and in traditional Chinese. Remember the person you are speaking with is most likely from Taiwan so respond accordingly.
here is the question from user: {question from user}
The first question prompted by the user was “我該如何申請低收入戶補助?”. The follow up question “告訴我臺灣臺北市的低收入資格”. Following that: “根據你上面提供的資料. 我在臺北市住. 在臺北市有租一個房間. 月收13,455元 我可以申請嗎?”. It was believed that these questions would test if the LLM would reply in full traditional Chinese, give a relatively good idea of the speed of the LLM, and test if the LLM has good reasoning skills. When calculating the time, only the time necessary for the LLM to produce a response was calculated. Shown in image below (circled in red). These prompts were designed in this way to simulate a generally authentic, real-life usage case, not a clinical scientific study.
The accuracy of the responses were not judged because of possible hallucinations and the fact that the actual correct information would be provided to the system in a real world use case. However, the accuracy of answers based on the information that each LLM came up with was judged to test reasoning ability. In other words, we treated all responses produced as fact for each testing scenario. The “facts” each LLM came up with were used to judge their following responses.
Results
LLama3.2 (Meta)
llama3.2 took 9.883 s, 2.919 s, and 2.419 s respectively to answer each question. All of the responses are in traditional Chinese but there are a few glitches observed. For instance, “住住的人” that can be found in the responses of question 2 and 3. Other than that, the answers seemed to be fine, the logic and reasoning sound. Llama also sounded quite professional. This model appears to suit the goal quite well. View full response from LLama3.2 in the Full Response from LLMs section below.
Mistral (Mistral AI)
Mistral took 12.312 s, 29.308 s, and 16.970 s respectively to answer each of the questions. There is some use of Simplified Chinese. For instance, “身份” in responses one and responses two and “证明” in responses one and two. It seems that the language is accurate and the logic is quite clear. Another thing of note is, Mistral always starts responses with: “您好!” which might make the conversation feel robotic because there is not variation.
Gemini 1.5 Flash (Google)
Google Gemini 1.5 Flash took 8.516 s, 11.995 s, and 9.561 s respectively to answer each question. All responses are in traditional Chinese and the language is quite clear and accurate. The logic and reasoning is sound. It seems that Gemini can follow the conversion quite well. An interesting thing to note is that Gemini’s tone is very friendly which could help people using it feel more comfortable. However, perhaps due to this, most of its answers follow a formulaic format. Gemini would mostly start with a greeting and end with a sentence of encouragement. As a non-native Chinese speaker, the writer is unsure how this would make people in Taiwan feel. All in all, Gemini is quite a good fit for the goals.
gpt4o-mini (OpenAI)
gpt40-mini took 3.594 s, 2.580 s, and 2.488 s respectively to answer each question. There are some responses in Simplified Chinese. For instance, in the first response, gpt4o-mini uses “台灣” instead of “臺灣”. However it would seem that this is acceptable to some people living in Taiwan. Other than that, the language is clear and accurate. The logic and reasoning is sound. The writer observes that gpt4o seems to be very careful and tries to not give specific answers until there is correct information. This point makes gpt4o a good model to achieve research goals.
Discussion
From the results, only Gemini 1.5 Flash, LLama3.2, and gpt40-mini could give all the responses in traditional Chinese. Both have acceptable speeds that would not make the user feel impatient. Compared to Llama3.2, Gemini 1.5 Flash had better Chinese skills and responded without making mistakes. Gemini 1.5 Flash was also quite comforting and encouraging in its tone which might make it ideal for communicating with physically and mentally handicapped or disadvantaged groups and help them understand how to apply for government welfare resources.
Conclusion
After testing all the LLMs to see which one fits our goals the best, Gemini 1.5 Flash and gpt40-mini seems to be the best choice at the moment. To further narrow down the best LLM for the task, tests could be done on bigger models like llama3 with 70b parameters (Llama3.2 has 3b). To reduce cost and the network speed of using Gemini models from Google, testing could be done on Gemma or Gemma2 these model are both available on Ollama and are from Google.
Function calling是一種技術,允許LLM根據對話內容自主選擇並調用預定義的函數。這些函數可以用來執行各種任務,例如查詢實時數據、執行計算、生成圖像等。函數調用是建立 LLM 驅動的聊天機器人或代理(agents)的重要能力,這些聊天機器人或代理需要檢索 LLM 的上下文或通過將自然語言轉換為 API 調用來與外部工具互動。
sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/CentOS-*.repo
sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/CentOS-*.repo
sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/CentOS-*.repo
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("MediaTek-Research/Breeze-7B-Instruct-v1_0")
model = AutoModelForCausalLM.from_pretrained("MediaTek-Research/Breeze-7B-Instruct-v1_0")
$ ollama create --quantize q4_K_M my-breeze
transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd
creating new layer sha256:0853f0ad24e5865173bbf9ffcc7b0f5d56b66fd690ab1009867e45e7d2c4db0f
writing manifest
success