Leaderboard
We hold a leaderboard for evaluating model performance on our MANGO benchmark. We measure the success rate on Destination Finding (DF) questions and Route Finding (RF) problems, both with easy and hard settings. The evaluation is performed on all 53 mazes with average accuracy reported.
Submission
We welcome new submissions to our MANGO benchmark. To submit a new result, please email [dingpeng]@@uchicago.edu
with the paper of your method. We will then update the leaderboard and link your paper accordingly.
🏆 Benchmark 🏆
Rank | Model | DF (easy) | DF (hard) | RF (easy) | RF (hard) |
---|---|---|---|---|---|
1 🥇 | GPT-4-0613 | 0.83 | 0.58 | 0.55 | 0.45 |
2 🥈 | Claude-2 | 0.81 | 0.45 | 0.47 | 0.19 |
3 🥉 | Clause-1 | 0.72 | 0.36 | 0.33 | 0.11 |
4 | GPT-3.5-turbo-0613 | 0.57 | 0.32 | 0.15 | 0.03 |
5 | Llama-2 | 0.41 | 0.24 | 0.03 | 0.00 |
6 | RWKV | 0.19 | 0.20 | 0.01 | 0.00 |
Updated on: 2024-03-03