Researchers Unveil Igaku QA Benchmark, Highlighting Both Potential and Limitations of Large Language Models in Non-English Languages
In a recent study, a team of researchers evaluated Large Language Model (LLM) APIs, including GPT-4, GPT-3, and ChatGPT, on Japanese national medical licensing examinations from the past five years. The results demonstrated that GPT-4 outperforms its predecessors, passing all five years of the exams and showcasing the potential of LLMs in languages other than English.
Overall, this study highlights the potential of LLMs in languages beyond English, while also emphasizing the need for further research and development to address their limitations.
The research team, consisting of native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan, collected exam problems and answers from the past five years (2018-2022) and created the Igaku QA benchmark. This benchmark is unique as it solely relies on resources originally written in Japanese, without translations from other languages or countries.
Some interesting insights from the study include the observation that ChatGPT-EN, which is based on explicit translation, outperforms the standard ChatGPT in most cases. This suggests that the multilingual abilities of LLMs have limitations when translation is not explicitly addressed.
The study also raises concerns about the ethical implications of LLMs in specialized domains. In some instances, the models selected prohibited choices that are strictly avoided in medical practice in Japan, such as recommending euthanasia. This underlines the importance of addressing these limitations and enhancing the safety of LLMs for real-world applications.
As the research team releases the Igaku QA benchmark, model outputs, and meta-information, they hope to foster progress in the development and application of LLMs for diverse languages. By addressing the limitations and furthering research in the field, LLMs have the potential to revolutionize language understanding and processing across a wide range of domains, including specialized areas like medicine.