Publications tagged "agentic-ai"
-
PreprintarXiv preprint arXiv:2605.20086 2026Recent work pairs LLMs with evolutionary search to iteratively generate, modify, and select code using task-specific feedback. These systems have produced strong results in mathematical discovery and algorithm design, yet a fundamental question remains: what do they actually evolve? Progress is typically summarized by the best score a run reaches under a task-specific evaluator, but that score can reflect several different mechanisms: new algorithmic structure, re-tuning an existing strategy, recombining ideas already in the model’s internal knowledge, or overfitting to the evaluator. Distinguishing these mechanisms requires inspecting the search process itself, not only its final outcome. We introduce EvoTrace, a dataset of evolutionary coding traces spanning four evolutionary frameworks, reasoning and non-reasoning models, and 16 tasks across mathematics and algorithm design. To analyze these traces, we develop EvoReplay, a replay-based methodology that reconstructs the local search states behind high-scoring solutions and tests controlled interventions, including adjusting constants, removing program components and substituting models or prompting contexts. We annotate every code edit in EvoTrace with one of nine recurring edit types using an LLM-as-judge pipeline validated against blind human re-annotation. Across EvoTrace, most score gains come from a small subset of these edit types. We further find a deterministic cycling pattern: about 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. These results show that benchmark gains in evolutionary coding agents can arise from qualitatively different mechanisms, only some of which correspond to new algorithmic structure. EvoTrace enables more diagnostic evaluation of evolutionary coding agents beyond final benchmark scores.
@article{pelleriti2026what, title = {What Do Evolutionary Coding Agents Evolve?}, author = {Pelleriti, Nico and Nelaturu, Sree Harsha and Zhou, Zhanke and Li, Zongze and Zimmer, Max and Han, Bo and Pokutta, Sebastian}, journal = {arXiv preprint arXiv:2605.20086}, year = {2026}, } -
Workshop ICML 2026 Workshop: AI as a Tool for Mathematics, Computer Science, and Machine Learning (Oral presentation) 2026AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Yet for many, it remains unclear how these tools fit into everyday research practice. This paper is a practical guide to AI-assisted research in mathematics and machine learning: We discuss how researchers can use modern AI systems productively, where these systems help most, and what kinds of guardrails are needed to use them responsibly. It is organized into three parts: (I) a five-level taxonomy of AI integration, (II) an open-source framework that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framework runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal-laptop prototyping to multi-node, multi-GPU experimentation across compute clusters. In practice, our longest autonomous session ran for over 20 hours, dispatching independent experiments across multiple nodes without human intervention. We stress that our framework is not intended to replace the researcher in the loop, but to augment them. Our code is publicly available at https://github.com/ZIB-IOL/The-Agentic-Researcher.
@inproceedings{zimmer2026agentic, title = {The Agentic Researcher: A Practical Guide to {AI}-Assisted Research in Mathematics and Machine Learning}, author = {Zimmer, Max and Pelleriti, Nico and Roux, Christophe and Pokutta, Sebastian}, booktitle = {ICML 2026 Workshop: AI as a Tool for Mathematics, Computer Science, and Machine Learning}, year = {2026}, url = {https://openreview.net/forum?id=vpcw03stJR}, } -
Z. Zhou, Z. Li, W. Huang, X. Li, C. Cao, and 14 more authorsWorkshop ICML 2026 Workshop: AI as a Tool for Mathematics, Computer Science, and Machine Learning 2026Open agents are no longer just models: they are model-harness systems whose behavior depends on tool access, control loops, execution feedback, memory, etc. Yet existing evaluations either test models without a harness or fix a single harness to compare models within a domain, obscuring how harness design shapes reasoning capability. To fill this gap, we introduce AlphaDiana, a unified system for harness-aware evaluation of open agents on verifiable reasoning tasks. AlphaDiana standardizes models, harnesses, benchmarks, execution environments, scorers, budgets, and trajectory logging, enabling controlled comparisons across both models and harnesses. We use AlphaDiana to evaluate open agents across mathematical, scientific, coding, terminal, and multimodal reasoning tasks, combining macro-level comparisons of direct inference and agentic execution with micro-level ablations of core harness capabilities. Trajectory-level analysis attributes successes and failures to reasoning, tool use, execution, state, budget, recovery, and verification, revealing that harnesses can both enable iterative problem-solving and introduce systematic errors. AlphaDiana moves agent evaluation beyond asking whether a system succeeds toward explaining why a model-harness system succeeds or fails.
@inproceedings{zhou2026reasoning, title = {Reasoning Is More Than the Model: Harness-Aware Evaluation of Agents on Verifiable Reasoning Tasks}, author = {Zhou, Zhanke and Li, Zongze and Huang, Weikai and Li, Xuan and Cao, Chentao and Feng, Xiao and Lu, Xiangyu and Hu, Jinbo and Lu, Menghan and Xie, Yi and Pelleriti, Nico and Liu, Shiyang and Zimmer, Max and Miranda, Brando and Yao, Jiangchao and Liu, Bo and Koyejo, Sanmi and Pokutta, Sebastian and Han, Bo}, booktitle = {ICML 2026 Workshop: AI as a Tool for Mathematics, Computer Science, and Machine Learning}, year = {2026}, url = {https://openreview.net/forum?id=fnHhEf0cSE}, }