Roxas D. Q., Emekci H.
JOURNAL OF POLYTECHNIC-POLITEKNIK DERGISI, cilt.29, sa.1, ss.1-20, 2026 (ESCI, TRDizin)
Özet
The growing use of Large Language Models (LLMs) in healthcare raises important questions about the need for domain-specific training in medical applications. This study presents a detailed evaluation of medical-domain and general-purpose LLMs using three medical datasets (PubMedQA, BioASQ, and WikiDoc), which contain approximately 11,000 question-answer pairs. We evaluated four medical-domain models (Meditron-7B, BioMistral-7B, MedAlpaca-13B, and PMC-LLaMA-13B) against four general-purpose instruction-tuned models (Ministral-8B-Instruct, Gemma 2-9B-it, Vicuna-13B v1.5, and Llama 3-8B-Instruct). Across 182,944 prompts in both zero-shot and few-shot settings, our findings show that general-purpose models consistently outperformed their medical-specific counterparts on all evaluation metrics. Specifically, Ministral-8B-Instruct achieved the highest performance in few-shot settings with a BERTScore of 0.613, SimCSE of 0.764, and semantic similarity of 0.684. These scores were significantly higher than those of the best medical model, BioMistral-7B (0.545, 0.678, and 0.533, respectively). Furthermore, zero-shot performance often matched or surpassed few-shot results, as seen with Llama-3-8B-Instruct achieving a SimCSE score of 0.794. These findings challenge the common assumption that domain-specific pretraining is required for optimal performance in specialized tasks and have major implications for how resources are allocated in healthcare AI development.