Academic Guidebook Chatbot: Performance Comparison of Fine-Tuned Mistral 7B and LlaMA-2 7B
Downloads
Background: Chatbot is recently ranked as the main technological solution due to the high demand for fast and efficient information retrieval. Therefore, this study was carried out to develop a local document-based chatbot that can answer questions related to the contents of PDF documents using open-source AI models such as Mistral 7B and LLaMA-2 7B. Although these models were effective at processing natural language, a major challenge was observed in the tendency to generate hallucinated answers, characterized by having inaccuracies and being out of context.
Objective: This study aims to reduce hallucinatory responses from chatbot models by making their responses more precise and accurate through fine-tuning. The performance of fine-tuned models (Mistral 7B and LLaMA-2 7B) was also compared.
Methods: Fine-tuning of the two models was performed using domain-specific datasets taken from Academic Guidebook. This process was conducted to improve models ability to understand and answer questions relevant to Academic Guidebook context. Performance was evaluated using METEOR Score to measure literal agreement and BERTScore to assess meaning agreement. In addition, response time was measured to assess efficiency, while chatbot system was developed using Streamlit and LangChain for real-time interaction.
Results: Fine-tuned Mistral 7B model achieved the highest METEOR value of 0.40 and F1 of 0.78 based on BERTScore results. Regarding efficiency, fine-tuned Mistral 7B showed a faster response time than LLaMA-2. Meanwhile, the non-fine-tuned Mistral 7B and LLaMA-2 7B showed a longer response time than fine-tuned Mistral 7B and LLaMA-2 7B.
Conclusion: The results showed that the enhancements significantly improved the performance of large language models in specific tasks, reduced hallucinations, and enhanced response quality
Keywords: Chatbot, Large Language Model, Mistral 7B, LLaMA-2 7B, METEOR Score
Z. Chen et al., “MEDITRON-70B: Scaling Medical Pretraining for Large Language Models,” pp. 1–38, 2023, [Online]. Available: http://arxiv.org/abs/2311.16079
T. Han et al., “MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data,” pp. 1–9, 2023, [Online]. Available: http://arxiv.org/abs/2304.08247
M. Y. Jabarulla, S. Oeltze-Jafra, P. Beerbaum, and T. Uden, “MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline,” 2024, [Online]. Available: https://github.com/yaseen28/MedDoc-Bot
H. Abdelazim, M. Tharwat, and A. Mohamed, “Semantic Embeddings for Arabic Retrieval Augmented Generation (ARAG),” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 11, pp. 1328–1334, 2023, doi: 10.14569/IJACSA.2023.01411135.
K. Muludi, K. M. Fitria, J. Triloka, and S. -, “Retrieval-Augmented Generation Approach: Document Question Answering using Large Language Model,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 3, pp. 776–785, 2024, doi: 10.14569/IJACSA.2024.0150379.
S. Verma, K. Tran, Y. Ali, and G. Min, “Reducing llm hallucinations using epistemic neural networks,” arXiv Prepr. arXiv2312.15576, 2023.
R. A. et al., “Automating Machine Learning Model Development: An OperationalML Approach with PyCARET and Streamlit,” in 2023 Innovations in Power and Advanced Computing Technologies (i-PACT), 2023, pp. 1–6. doi: 10.1109/i-PACT58649.2023.10434389.
O. Topsakal and T. C. Akinci, “Creating large language model applications utilizing langchain: A primer on developing llm apps fast,” in International Conference on Applied Engineering and Natural Sciences, 2023, vol. 1, no. 1, pp. 1050–1056.
A. Q. Jiang et al., “Mistral 7B,” pp. 1–9, 2023, [Online]. Available: http://arxiv.org/abs/2310.06825
M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv Prepr. arXiv2106.08295, 2021.
L. Chen and P. Lou, “Clipping-Based Post Training 8-Bit Quantization of Convolution Neural Networks for Object Detection,” Appl. Sci., vol. 12, no. 23, p. 12405, 2022.
Z. Zhang et al., “Exploring the potential of flexible 8-bit format: Design and algorithm,” arXiv Prepr. arXiv2310.13513, 2023.
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv Prepr. arXiv2307.09288, 2023.
U. JEPARA, “Pedoman Akademik TA. 2021/2022,” in UNISNU JEPARA, Mulyadi, K. Sa’diyah, H. Amalia, A. Riyadi, P. Nugroho, A. Tyanto, A. Zamroni, A. Maulana, and Miftahurrohman, Eds. Jepara: Universitas Islam Nahdlatul Ulama Jepara, 2021, pp. 52–60. [Online]. Available: https://drive.google.com/file/d/1guq08eoE0cx9rtQ_7B865MZqkXsWJucF/view
R. Gelar Guntara, “Pemanfaatan Google Colab Untuk Aplikasi Pendeteksian Masker Wajah Menggunakan Algoritma Deep Learning YOLOv7,” J. Teknol. Dan Sist. Inf. Bisnis, vol. 5, no. 1, pp. 55–60, 2023, doi: 10.47233/jteksis.v5i1.750.
T. T. Tin, S. Y. Xuan, W. M. Ee, L. K. Tiung, and A. Aitizaz, “Interactive ChatBot for PDF Content Conversation Using an LLM Language Model,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 9, 2024, doi: 10.14569/IJACSA.2024.01509105.
B. Hendrickx, “Meteor,” Sp. Explor. Humanit. a Hist. Encycl. Vol. 1-2, vol. 1–2, no. June, pp. 344–346, 2010, doi: 10.1145/2567940.
M. Douze et al., “The Faiss library,” 2024, [Online]. Available: http://arxiv.org/abs/2401.08281
M. Amin, “Development of a Music Education Framework Using Large Language Models (LLMs),” 2024.
D.-M. Petroșanu, A. Pîrjan, and A. Tăbușcă, “Tracing the Influence of Large Language Models across the Most Impactful Scientific Works,” Electronics, vol. 12, no. 24, p. 4957, 2023.
D. Lin, Y. Wen, W. Wang, and Y. Su, “Enhanced Sentiment Intensity Regression Through LoRA Fine-Tuning on Llama 3,” IEEE Access, vol. 12, pp. 108072–108087, 2024, doi: 10.1109/ACCESS.2024.3438353.
D. A. Hameed, T. A. Faisal, A. M. Alshaykha, G. T. Hasan, and H. A. Ali, “Automatic evaluating of Russian-Arabic machine translation quality using METEOR method,” AIP Conf. Proc., vol. 2386, no. 1, p. 40036, 2022, doi: 10.1063/5.0067018.
S. Banerjee and A. Lavie, “Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments,” Proc. ACL-WMT, pp. 65–72, 2004.
S. Minaee et al., “Large language models: A survey,” arXiv Prepr. arXiv2402.06196, 2024.
Copyright (c) 2025 The Authors. Published by Universitas Airlangga.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
All accepted papers will be published under a Creative Commons Attribution 4.0 International (CC BY 4.0) License. Authors retain copyright and grant the journal right of first publication. CC-BY Licenced means lets others to Share (copy and redistribute the material in any medium or format) and Adapt (remix, transform, and build upon the material for any purpose, even commercially).















