UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

Bo Yang¹, Qingping Yang², Yingwei Ma², Runtao Liu³,

¹South China University of Technology, ²ReasonMind, ³Hong Kong University of Science and Technology

sdyangbo02 [at] mail [dot] scut [dot] edu [dot] cn

qingping95 [at] gmail [dot] com

yingwei [dot] ywma [at] gmail [dot] com

runtao219 [at] gmail [dot] com

🚀 Home Page arXiv Code 🤗 HFDataset UTMath 🤗 HFDataset UTMath_Train

UTMath is a cutting-edge and comprehensive benchmark designed to evaluate the mathematical reasoning abilities of Large Language Models. It consists of 1,053 problems, each with an average of 68 test cases, ensuring that models genuinely solve the problems rather than merely recalling memorized answers.

The Reasoning-to-Coding of Thoughts (RCoT) approach complements the UTMath Benchmark by encouraging LLMs to engage in explicit reasoning prior to generating code. RCoT significantly improves the efficiency and effectiveness of the solution, suggesting that it encourages the model to reason critically and find more efficient solutions.

⚡️Multiple Case Validation: Instead of using single cases that can be memorized, our questions are sequence-based, allowing numerous cases for validating true understanding.
🔧General Solutions: UTMath requires large models to solve problems by generating code, aiming for general solutions rather than problem-specific ones, reflecting a closer alignment with intelligence.
🏆Enhanced Reasoning: Emphasizing reasoning allows large models to focus more on improving the quality of reasoning, thereby delivering higher-quality and more efficient solutions.
🌐Modularity: By separating reasoning from implementation, the influence of coding on reasoning can be eliminated, providing a new paradigm for evaluating the reasoning ability through the code generated by the model.

[2024/11]🚀🚀🚀: We released our code📝

[2024/11]🚀🚀🚀: We released our benchmark UTMath.

[2025/01]🚀🚀🚀: UTMath has already received over 390 downloads on Hugging Face.

💬 Citation

If you find our work interesting and meaningful, welcome to give a 🌟 to our repo and cite our paper.

@article{yang2024utmath,
      title={UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts},
      author={Yang, Bo and Yang, Qingping and Liu, Runtao},
      journal={arXiv preprint arXiv:2411.07240},
      year={2024}
    }

🥰 Acknowledgement

We sincerely thank the OEIS for its tireless efforts and contributions to the advancement of mathematics and computer science.

We are also grateful to HumanEval for providing valuable code resources.

Abstract

The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and generality. This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess LLMs through extensive unit tests, with a focus on both the accuracy and generality of model responses. It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%. Furthermore, we present the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to engage in explicit reasoning prior to code generation, thereby facilitating the production of more sophisticated solutions and enhancing overall performance and efficiency. Additionally, we also release the UTMath-Train training dataset (more than 70k samples), to support the community in further exploring mathematical reasoning.

🥇 Leaderboard

The best model, GPT-4o, only solves 26.93% problems in our benchmark, demonstrate the difficulty of our benchmarks.

Pass Rate and Average Run Time of LLMs on UTMath. We listed the performance of eight large models using PoT(Program of Thoughts) and RCoT methods across a range of metrics. For o1-mini and o1-preview only Pass@1 data is currently available due to resource constraints. The average run time is calculated based on the problems solved by the PoT or RCoT methods. The efficiency is calculated as: (Avg.Runtime(PoT) - Avg.Runtime(RcoT)) / Avg.Runtime(RcoT).

🚠 Generation Pipeline

The benchmark comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem.

UTMath generation pipeline.After downloading 23,238 Principle Sequences from OEIS and cleaning the data, 1,053 usable sequences were obtained. Descriptions were standardized by adding background information and improving readability (highlighted in green). Hard cases were introduced to enhance discriminative capability, including terms from later positions to prevent simplistic algorithms from passing.

📋 Dataset Statistics

The benchmark comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem.

Comparison between UTMath and other benchmarks. UTMath offers a cutting-edge benchmark with a comprehensive set of 1,053 problems across multiple mathematical domains, providing a more accurate evaluation of LLMs' mathematical reasoning capabilities.

📖 Case Study

This is a qualitative analysis case study of UTMath and RCoT.

GPT-4o solves UTMath_948 by the PoT method, by the RCoT method, respectively. PoT simply performs brute-force solving, while RCoT involves deeper reasoning through Case merging after a classification discussion and the application of Euler's formula, providing a solution with lower time complexity.

😎 Some interesting findings

We conducted a comprehensive study with 8 LLMs. Some of our key findings are summarized as follows:

Modern LLMs perform poorly in Graph Theory, Group Theory, Geometry and Topology.

Performance on Different Problem Categories.(%) Categories are represented by abbreviations. NT: Number Theory; T.: Theory; DM: Discrete Mathematics; CM: Combinatorial Mathematics; GT: Geometry and Topology; PSE: Polynomial and Series Expansions; SN: Special Numbers; FL: Formal Languages.

RCoT can significantly improve the pass@k performance of LLMs. With RCoT, 7 of 8 evaluated LLMs generated more efficient solutions, with most models achieving higher scores.

Performance comparison of models across PoT and RCoT tasks at different pass@k levels.

The quality of reasoning significantly impacts the accuracy and efficiency of the model's final solution.

Performance comparison between self-reasoning and using GPT-4o reasoning for coding across different models. The results show that models perform better when relying on GPT-4o's reasoning output.

We hope our findings contribute to a deeper understanding of current reasoning ability of LLMs and the further development of models.