Academic Guidebook Chatbot: Performance Comparison of Fine-Tuned Mistral 7B and LlaMA-2 7B

Davied Indra Rachman; Agus Subhan Akbar; Alzena Dona Sabilla

doi:10.20473/jisebi.11.3.383-392

Authors

Davied Indra Rachman
daviedindra20@gmail.com
Department of Information System, Faculty of Science and Technology, Universitas Islam Nahdlatul Ulama Jepara, Jawa Tengah, Indonesia https://orcid.org/0009-0005-1711-0788
Agus Subhan Akbar Department of Information System, Faculty of Science and Technology, Universitas Islam Nahdlatul Ulama Jepara, Jawa Tengah, Indonesia https://orcid.org/0000-0002-6011-7011
Alzena Dona Sabilla Department of Information System, Faculty of Science and Technology, Universitas Islam Nahdlatul Ulama Jepara, Jawa Tengah, Indonesia https://orcid.org/0009-0005-3048-5787

Vol. 11 No. 3 (2025): October

Articles

October 28, 2025

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

Background: Chatbot is recently ranked as the main technological solution due to the high demand for fast and efficient information retrieval. Therefore, this study was carried out to develop a local document-based chatbot that can answer questions related to the contents of PDF documents using open-source AI models such as Mistral 7B and LLaMA-2 7B. Although these models were effective at processing natural language, a major challenge was observed in the tendency to generate hallucinated answers, characterized by having inaccuracies and being out of context.

Objective: This study aims to reduce hallucinatory responses from chatbot models by making their responses more precise and accurate through fine-tuning. The performance of fine-tuned models (Mistral 7B and LLaMA-2 7B) was also compared.

Methods: Fine-tuning of the two models was performed using domain-specific datasets taken from Academic Guidebook. This process was conducted to improve models ability to understand and answer questions relevant to Academic Guidebook context. Performance was evaluated using METEOR Score to measure literal agreement and BERTScore to assess meaning agreement. In addition, response time was measured to assess efficiency, while chatbot system was developed using Streamlit and LangChain for real-time interaction.

Results: Fine-tuned Mistral 7B model achieved the highest METEOR value of 0.40 and F1 of 0.78 based on BERTScore results. Regarding efficiency, fine-tuned Mistral 7B showed a faster response time than LLaMA-2. Meanwhile, the non-fine-tuned Mistral 7B and LLaMA-2 7B showed a longer response time than fine-tuned Mistral 7B and LLaMA-2 7B.

Conclusion: The results showed that the enhancements significantly improved the performance of large language models in specific tasks, reduced hallucinations, and enhanced response quality

Keywords: Chatbot, Large Language Model, Mistral 7B, LLaMA-2 7B, METEOR Score

Z. Chen et al., “MEDITRON-70B: Scaling Medical Pretraining for Large Language Models,” pp. 1–38, 2023, [Online]. Available: http://arxiv.org/abs/2311.16079

T. Han et al., “MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data,” pp. 1–9, 2023, [Online]. Available: http://arxiv.org/abs/2304.08247

M. Y. Jabarulla, S. Oeltze-Jafra, P. Beerbaum, and T. Uden, “MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline,” 2024, [Online]. Available: https://github.com/yaseen28/MedDoc-Bot

H. Abdelazim, M. Tharwat, and A. Mohamed, “Semantic Embeddings for Arabic Retrieval Augmented Generation (ARAG),” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 11, pp. 1328–1334, 2023, doi: 10.14569/IJACSA.2023.01411135.

K. Muludi, K. M. Fitria, J. Triloka, and S. -, “Retrieval-Augmented Generation Approach: Document Question Answering using Large Language Model,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 3, pp. 776–785, 2024, doi: 10.14569/IJACSA.2024.0150379.

S. Verma, K. Tran, Y. Ali, and G. Min, “Reducing llm hallucinations using epistemic neural networks,” arXiv Prepr. arXiv2312.15576, 2023.

R. A. et al., “Automating Machine Learning Model Development: An OperationalML Approach with PyCARET and Streamlit,” in 2023 Innovations in Power and Advanced Computing Technologies (i-PACT), 2023, pp. 1–6. doi: 10.1109/i-PACT58649.2023.10434389.

O. Topsakal and T. C. Akinci, “Creating large language model applications utilizing langchain: A primer on developing llm apps fast,” in International Conference on Applied Engineering and Natural Sciences, 2023, vol. 1, no. 1, pp. 1050–1056.

A. Q. Jiang et al., “Mistral 7B,” pp. 1–9, 2023, [Online]. Available: http://arxiv.org/abs/2310.06825

M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv Prepr. arXiv2106.08295, 2021.

L. Chen and P. Lou, “Clipping-Based Post Training 8-Bit Quantization of Convolution Neural Networks for Object Detection,” Appl. Sci., vol. 12, no. 23, p. 12405, 2022.

Z. Zhang et al., “Exploring the potential of flexible 8-bit format: Design and algorithm,” arXiv Prepr. arXiv2310.13513, 2023.

H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv Prepr. arXiv2307.09288, 2023.

U. JEPARA, “Pedoman Akademik TA. 2021/2022,” in UNISNU JEPARA, Mulyadi, K. Sa’diyah, H. Amalia, A. Riyadi, P. Nugroho, A. Tyanto, A. Zamroni, A. Maulana, and Miftahurrohman, Eds. Jepara: Universitas Islam Nahdlatul Ulama Jepara, 2021, pp. 52–60. [Online]. Available: https://drive.google.com/file/d/1guq08eoE0cx9rtQ_7B865MZqkXsWJucF/view

R. Gelar Guntara, “Pemanfaatan Google Colab Untuk Aplikasi Pendeteksian Masker Wajah Menggunakan Algoritma Deep Learning YOLOv7,” J. Teknol. Dan Sist. Inf. Bisnis, vol. 5, no. 1, pp. 55–60, 2023, doi: 10.47233/jteksis.v5i1.750.

T. T. Tin, S. Y. Xuan, W. M. Ee, L. K. Tiung, and A. Aitizaz, “Interactive ChatBot for PDF Content Conversation Using an LLM Language Model,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 9, 2024, doi: 10.14569/IJACSA.2024.01509105.

B. Hendrickx, “Meteor,” Sp. Explor. Humanit. a Hist. Encycl. Vol. 1-2, vol. 1–2, no. June, pp. 344–346, 2010, doi: 10.1145/2567940.

M. Douze et al., “The Faiss library,” 2024, [Online]. Available: http://arxiv.org/abs/2401.08281

M. Amin, “Development of a Music Education Framework Using Large Language Models (LLMs),” 2024.

D.-M. Petroșanu, A. Pîrjan, and A. Tăbușcă, “Tracing the Influence of Large Language Models across the Most Impactful Scientific Works,” Electronics, vol. 12, no. 24, p. 4957, 2023.

D. Lin, Y. Wen, W. Wang, and Y. Su, “Enhanced Sentiment Intensity Regression Through LoRA Fine-Tuning on Llama 3,” IEEE Access, vol. 12, pp. 108072–108087, 2024, doi: 10.1109/ACCESS.2024.3438353.

D. A. Hameed, T. A. Faisal, A. M. Alshaykha, G. T. Hasan, and H. A. Ali, “Automatic evaluating of Russian-Arabic machine translation quality using METEOR method,” AIP Conf. Proc., vol. 2386, no. 1, p. 40036, 2022, doi: 10.1063/5.0067018.

S. Banerjee and A. Lavie, “Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments,” Proc. ACL-WMT, pp. 65–72, 2004.

S. Minaee et al., “Large language models: A survey,” arXiv Prepr. arXiv2402.06196, 2024.

Academic Guidebook Chatbot: Performance Comparison of Fine-Tuned Mistral 7B and LlaMA-2 7B

Authors

Downloads

Login

SJR

Editorial Policies

Instruction For Author

Article Templates and Instructions

Accreditation Certificate

Citation Analysis

visitors

Visitors

Indexed In

Indexed In

Twitter

Address

Contact Info: