Skip to content

Leaderboard

We hold a leaderboard for evaluating model performance on our MANGO benchmark. We measure the success rate on Destination Finding (DF) questions and Route Finding (RF) problems, both with easy and hard settings. The evaluation is performed on all 53 mazes with average accuracy reported.

Submission

We welcome new submissions to our MANGO benchmark. To submit a new result, please email [dingpeng]@@uchicago.edu with the paper of your method. We will then update the leaderboard and link your paper accordingly.

🏆 Benchmark 🏆

Rank Model DF (easy) DF (hard) RF (easy) RF (hard)
1 🥇 GPT-4-0613 0.83 0.58 0.55 0.45
2 🥈 Claude-2 0.81 0.45 0.47 0.19
3 🥉 Clause-1 0.72 0.36 0.33 0.11
4 GPT-3.5-turbo-0613 0.57 0.32 0.15 0.03
5 Llama-2 0.41 0.24 0.03 0.00
6 RWKV 0.19 0.20 0.01 0.00

Updated on: 2024-03-03