Zhenting Qi

Cambridge, MA 02138, U.S.A.

Welcome! I am an incoming Computer Science Ph.D. student at Harvard University, where I am honored to be co-advised by Prof. Yilun Du and Prof. Hima Lakkaraju. My research centers around developing intelligent and reliable AI systems that benefit human society. Motivated by this, I am generally interested in the following topics (w/o particular order):

Reasoning
- Understanding and enhancing reasoning capabilities in foundation models
- Developing AI systems that generalize effectively to OOD scenarios
- Training (multi-)agents for compositional reasoning tasks
Reliability
- Improving understanding of foundation models and AI systems
- Enhancing controllability and robustness
- Designing scalable methods to ensure reliability while advancing capabilities

More specifically, I am currently exploring several exciting directions:

🤖 Training agents for coding assistance and scientific discovery
🧠 Developing advanced memory mechanisms for agents
🔄 Training-time and test-time self-evolution
📊 Dynamic evaluation for reasoning/generalization of foundation models and agents

Our research group actively welcomes collaborations, and I am always excited to chat about research ideas! Please feel free to reach out at: zhentingqi [at] g [dot] harvard [dot] edu

For more information about my research, please see Google Scholar, Semantic Scholar, or DBLP.

About Me

I hold a master’s degree in Computational Science and Engineering from Harvard, and dual bachelor’s degrees in Computer Engineering from UIUC and ZJU (highest honors). I am also a recipient of Harvard SEAS Prize Fellowship.

I’ve had the privilege of working closely with many distinguished researchers, including (the late) Prof. Dragomir R. Radev at Yale, Prof. Volodymyr Kindratenko at UIUC, Dr. Li Lyna Zhang at Microsoft Research Asia, Prof. Chuang Gan at MIT-IBM Watson AI Lab, and Prof. James Glass at MIT.

News

May 30, 2025	Our paper Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search 【新智元】 has been accepted to `ICML 2025`.
May 01, 2025	Will be joining Google DeepMind (Mountain View office) as a Student Researcher!
Apr 15, 2025	I will continue my research journey at Harvard as a PhD student!
Jan 30, 2025	Our papers Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers 【机器之心】, Quantifying Generalization Complexity for Large Language Models, Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems have been accepted to `ICLR 2025`.

Selected Publications

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Maohao Shen^*, Guangtao Zeng^*, Zhenting Qi^*, Zhang-Wei Hong, Zhenfang Chen, and 5 more authors

In the Forty-Second International Conference on Machine Learning (ICML), 2025

Abs PDF

Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs’ reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.
rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solver

Zhenting Qi^*, Mingyuan Ma^*, Jiahang Xu^*, Li Lyna Zhang, Fan Yang, and 1 more author

In The Thirteenth International Conference on Learning Representations (ICLR), 2025

Abs arXiv PDF Code

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct.
Quantifying Generalization Complexity for Large Language Models

Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, and 3 more authors

In The Thirteenth International Conference on Learning Representations (ICLR), 2025

Abs PDF

While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs’ generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs’ generalization capabilities.
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Zhenting Qi, Hanlin Zhang, Eric P. Xing, Sham M. Kakade, and Himabindu Lakkaraju

In The Thirteenth International Conference on Learning Representations (ICLR), 2025

Abs PDF

Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs). We show that an adversary can exploit LMs’ instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with instruction-tuned LMs via prompt injection. The vulnerability exists for a wide range of modern LMs that span Llama2, Mistral/Mixtral, Vicuna, SOLAR, WizardLM, Qwen1.5, and Platypus2, and the exploitability exacerbates as the model size scales up. We also study multiple effects of RAG setup on the extractability of data, indicating that following unexpected instructions to regurgitate data can be an outcome of failure in effectively utilizing contexts for modern LMs, and further show that such vulnerability can be greatly mitigated by position bias elimination strategies. Extending our study to production RAG models GPTs, we design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized GPTs with at most 2 queries, and we extract text data verbatim at a rate of 41% from a book of 77,000 words and 3% from a corpus of 1,569,000 words by prompting the GPTs with only 100 queries generated by themselves.
LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control

Yilun Zhao^*, Zhenting Qi^*, Linyong Nan, Lorenzo Jaime Flores, and Dragomir Radev

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023

Abs DOI PDF

Logical Table-to-Text (LT2T) generation is tasked with generating logically faithful sentences from tables. There currently exists two challenges in the field: 1) Faithfulness: how to generate sentences that are factually correct given the table content; 2) Diversity: how to generate multiple sentences that offer different perspectives on the table. This work proposes LoFT, which utilizes logic forms as fact verifiers and content planners to control LT2T generation. Experimental results on the LogicNLG dataset demonstrate that LoFT is the first model that addresses unfaithfulness and lack of diversity issues simultaneously. Our code is publicly available at \urlhttps://github.com/Yale-LILY/LoFT.
PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

Zhenting Qi, Xiaoyu Tan, Shaojie Shi, Chao Qu, Yinghui Xu, and 1 more author

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, Dec 2023

Abs DOI PDF

Instruction fine-tuning has conventionally been employed to adapt Large Language Models (LLMs) to a variety of diverse tasks. Nonetheless, this technique often necessitates substantial computational resources, making it impractical for deployment by individuals or small-scale entities. Recently, Low-Rank Adaptation (LoRA) has become a promising alternative, offering tuning capabilities with reduced resource overhead. However, attaining satisfactory performance through the fine-tuning of LoRA is a non-trivial challenge. In this paper, we propose PILLOW, which aims to improve LoRA’s performance by leveraging LLM’s in-context learning capability through prompt matching via reinforcement learning in resource-constrained environments. Specifically, PILLOW incorporates a matching network that selects prompts from a user-defined pool, concatenates the optimal prompts given the user instruction, and performs inference using the LoRA-fine-tuned LLMs. Compared with typical instruction fine-tuning methods, PILLOW exhibits commensurate performance on various evaluation metrics, utilizing only consumer-grade GPU resources and exhibiting a large increase in training efficiency.
SaFER: A Robust and Efficient Framework for Fine-tuning BERT-based Classifier with Noisy Labels

Zhenting Qi, Xiaoyu Tan, Chao Qu, Yinghui Xu, and Yuan Qi

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), Jul 2023

Abs DOI PDF

Learning on noisy datasets is a challenging problem when pre-trained language models are applied to real-world text classification tasks. In numerous industrial applications, acquiring task-specific datasets with 100% accurate labels is difficult, thus many datasets are accompanied by label noise at different levels. Previous work has shown that existing noise-handling methods could not improve the peak performance of BERT on noisy datasets, and might even deteriorate it. In this paper, we propose SaFER, a robust and efficient fine-tuning framework for BERT-based text classifiers, combating label noises without access to any clean data for training or validation. Utilizing a label-agnostic early-stopping strategy and self-supervised learning, our proposed framework achieves superior performance in terms of both accuracy and speed on multiple text classification benchmarks. The trained model is finally fully deployed in several industrial biomedical literature mining tasks and demonstrates high effectiveness and efficiency.
FOLIO: Natural Language Reasoning with First-Order Logic

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, and 30 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov 2024

Abs DOI arXiv PDF

Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. The logical correctness of the premises and conclusions is ensured by their FOL annotations, which are automatically verified by an FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models. Our results show that a subset of FOLIO remains a challenge for one of the most capable Large Language Model (LLM) publicly available, GPT-4.