UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

Bo Yang1, Qingping Yang2, Yingwei Ma2, Runtao Liu3,
1South China University of Technology, 2ReasonMind, 3Hong Kong University of Science and Technology
sdyangbo02 [at] mail [dot] scut [dot] edu [dot] cn
qingping95 [at] gmail [dot] com
yingwei [dot] ywma [at] gmail [dot] com
runtao219 [at] gmail [dot] com

image

UTMath is a cutting-edge and comprehensive benchmark designed to evaluate the mathematical reasoning abilities of Large Language Models. It consists of 1,053 problems, each with an average of 68 test cases, ensuring that models genuinely solve the problems rather than merely recalling memorized answers.

The Reasoning-to-Coding of Thoughts (RCoT) approach complements the UTMath Benchmark by encouraging LLMs to engage in explicit reasoning prior to generating code. RCoT significantly improves the efficiency and effectiveness of the solution, suggesting that it encourages the model to reason critically and find more efficient solutions.

  • ⚑️Multiple Case Validation: Instead of using single cases that can be memorized, our questions are sequence-based, allowing numerous cases for validating true understanding.
  • πŸ”§General Solutions: UTMath requires large models to solve problems by generating code, aiming for general solutions rather than problem-specific ones, reflecting a closer alignment with intelligence.
  • πŸ†Enhanced Reasoning: Emphasizing reasoning allows large models to focus more on improving the quality of reasoning, thereby delivering higher-quality and more efficient solutions.
  • 🌐Modularity: By separating reasoning from implementation, the influence of coding on reasoning can be eliminated, providing a new paradigm for evaluating the reasoning ability through the code generated by the model.

[2024/11]πŸš€πŸš€πŸš€: We released our codeπŸ“

[2024/11]πŸš€πŸš€πŸš€: We released our benchmark UTMath.

[2025/01]πŸš€πŸš€πŸš€: UTMath has already received over 390 downloads on Hugging Face.

πŸ’¬ Citation

If you find our work interesting and meaningful, welcome to give a 🌟 to our repo and cite our paper.

@article{yang2024utmath,
      title={UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts},
      author={Yang, Bo and Yang, Qingping and Liu, Runtao},
      journal={arXiv preprint arXiv:2411.07240},
      year={2024}
    }

πŸ₯° Acknowledgement

  • We sincerely thank the OEIS for its tireless efforts and contributions to the advancement of mathematics and computer science.
  • We are also grateful to HumanEval for providing valuable code resources.
  • Abstract

    The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and generality. This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess LLMs through extensive unit tests, with a focus on both the accuracy and generality of model responses. It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%. Furthermore, we present the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to engage in explicit reasoning prior to code generation, thereby facilitating the production of more sophisticated solutions and enhancing overall performance and efficiency. Additionally, we also release the UTMath-Train training dataset (more than 70k samples), to support the community in further exploring mathematical reasoning.

    πŸ₯‡ Leaderboard

  • The best model, GPT-4o, only solves 26.93% problems in our benchmark, demonstrate the difficulty of our benchmarks.
  • image Pass Rate and Average Run Time of LLMs on UTMath. We listed the performance of eight large models using PoT(Program of Thoughts) and RCoT methods across a range of metrics. For o1-mini and o1-preview only Pass@1 data is currently available due to resource constraints. The average run time is calculated based on the problems solved by the PoT or RCoT methods. The efficiency is calculated as: (Avg.Runtime(PoT) - Avg.Runtime(RcoT)) / Avg.Runtime(RcoT).

    🚠 Generation Pipeline

  • The benchmark comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem.
  • image UTMath generation pipeline.After downloading 23,238 Principle Sequences from OEIS and cleaning the data, 1,053 usable sequences were obtained. Descriptions were standardized by adding background information and improving readability (highlighted in green). Hard cases were introduced to enhance discriminative capability, including terms from later positions to prevent simplistic algorithms from passing.

    πŸ“‹ Dataset Statistics

  • The benchmark comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem.
  • image Comparison between UTMath and other benchmarks. UTMath offers a cutting-edge benchmark with a comprehensive set of 1,053 problems across multiple mathematical domains, providing a more accurate evaluation of LLMs' mathematical reasoning capabilities.

    πŸ“– Case Study

  • This is a qualitative analysis case study of UTMath and RCoT.
  • image GPT-4o solves UTMath_948 by the PoT method, by the RCoT method, respectively. PoT simply performs brute-force solving, while RCoT involves deeper reasoning through Case merging after a classification discussion and the application of Euler's formula, providing a solution with lower time complexity.

    😎 Some interesting findings

    We conducted a comprehensive study with 8 LLMs. Some of our key findings are summarized as follows:

  • Modern LLMs perform poorly in Graph Theory, Group Theory, Geometry and Topology.
  • image Performance on Different Problem Categories.(%) Categories are represented by abbreviations. NT: Number Theory; T.: Theory; DM: Discrete Mathematics; CM: Combinatorial Mathematics; GT: Geometry and Topology; PSE: Polynomial and Series Expansions; SN: Special Numbers; FL: Formal Languages.

  • RCoT can significantly improve the pass@k performance of LLMs. With RCoT, 7 of 8 evaluated LLMs generated more efficient solutions, with most models achieving higher scores.
  • image Performance comparison of models across PoT and RCoT tasks at different pass@k levels.

  • The quality of reasoning significantly impacts the accuracy and efficiency of the model's final solution.
  • image Performance comparison between self-reasoning and using GPT-4o reasoning for coding across different models. The results show that models perform better when relying on GPT-4o's reasoning output.

    We hope our findings contribute to a deeper understanding of current reasoning ability of LLMs and the further development of models.