GPT-4 Surpasses ChatGPT and GPT-3 on Japanese Medical Licensing Examinations

April 3, 2023

Researchers Unveil Igaku QA Benchmark, Highlighting Both Potential and Limitations of Large Language Models in Non-English Languages

In a recent study, a team of researchers evaluated Large Language Model (LLM) APIs, including GPT-4, GPT-3, and ChatGPT, on Japanese national medical licensing examinations from the past five years. The results demonstrated that GPT-4 outperforms its predecessors, passing all five years of the exams and showcasing the potential of LLMs in languages other than English.

Overall, this study highlights the potential of LLMs in languages beyond English, while also emphasizing the need for further research and development to address their limitations.

The research team, consisting of native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan, collected exam problems and answers from the past five years (2018-2022) and created the Igaku QA benchmark. This benchmark is unique as it solely relies on resources originally written in Japanese, without translations from other languages or countries.

Some interesting insights from the study include the observation that ChatGPT-EN, which is based on explicit translation, outperforms the standard ChatGPT in most cases. This suggests that the multilingual abilities of LLMs have limitations when translation is not explicitly addressed.

The study also raises concerns about the ethical implications of LLMs in specialized domains. In some instances, the models selected prohibited choices that are strictly avoided in medical practice in Japan, such as recommending euthanasia. This underlines the importance of addressing these limitations and enhancing the safety of LLMs for real-world applications.

As the research team releases the Igaku QA benchmark, model outputs, and meta-information, they hope to foster progress in the development and application of LLMs for diverse languages. By addressing the limitations and furthering research in the field, LLMs have the potential to revolutionize language understanding and processing across a wide range of domains, including specialized areas like medicine.

Paper: https://arxiv.org/abs/2303.18027

Github: https://github.com/jungokasai/IgakuQA

Larian Studios Stands Firm: No AI “Generic Slop” in Our Games

AI’s Reality Check: ChatGPT’s Hallucination Crisis Deepens

SuperEdit: Image Editing with Smarter Supervision

OpenAI’s $3 Billion Windsurf Acquisition: A Coding Revolution Begins

OpenAI’s U-Turn: Why the World’s Leading AI Lab Is Sticking With Its Nonprofit Roots

Mistral’s New OCR API: A Game Changer for AI-Ready Documents

China’s Autonomous Agent, Manus, Changes Everything: The Dawn of Self-Directed AI

LLM Inference Hardware Calculator

Claude 3.7 Sonnet: The World’s First Hybrid AI Brain Coding and Reasoning

SambaNova Launches the Fastest DeepSeek-R1 671B with Unmatched Efficiency

Where AI Baby Wisdom Meets Canine Comedy

The Impact of OpenAI’s 4o Image Generation: A Visual Revolution

From Garage Invite to X-Rated Text: When AI Mishears, Chaos Follows

Best New Meme About DeepSeek and OpenAI

Simulation Confirmed

DeepSeek VS. OpenAI: Best Memes

Larian Studios Stands Firm: No AI “Generic Slop” in Our Games

AI’s Reality Check: ChatGPT’s Hallucination Crisis Deepens

SuperEdit: Image Editing with Smarter Supervision

OpenAI’s $3 Billion Windsurf Acquisition: A Coding Revolution Begins

OpenAI’s U-Turn: Why the World’s Leading AI Lab Is Sticking With Its Nonprofit Roots

Mistral’s New OCR API: A Game Changer for AI-Ready Documents

China’s Autonomous Agent, Manus, Changes Everything: The Dawn of Self-Directed AI

LLM Inference Hardware Calculator

Claude 3.7 Sonnet: The World’s First Hybrid AI Brain Coding and Reasoning

SambaNova Launches the Fastest DeepSeek-R1 671B with Unmatched Efficiency

Where AI Baby Wisdom Meets Canine Comedy

The Impact of OpenAI’s 4o Image Generation: A Visual Revolution

From Garage Invite to X-Rated Text: When AI Mishears, Chaos Follows

Best New Meme About DeepSeek and OpenAI

Simulation Confirmed

DeepSeek VS. OpenAI: Best Memes

Researchers Unveil Igaku QA Benchmark, Highlighting Both Potential and Limitations of Large Language Models in Non-English Languages

Must Read

AI Clones in Two Hours: The Rise of Personality-Mimicking Generative Agents

AI Autoimmune Care: Predicting Disease Progression with GPS

The Stargate Project: A $500 Billion Leap Into America’s AI Future

Are you so drunk you can’t even talk? With GPT-4 you can write a PHD thesis

First Steps with Kling AI: A Journey of Motherhood

GPT-4 Surpasses ChatGPT and GPT-3 on Japanese Medical Licensing Examinations

Researchers Unveil Igaku QA Benchmark, Highlighting Both Potential and Limitations of Large Language Models in Non-English Languages

RELATED ARTICLES

Must Read