Publications

A collection of my research work.

Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Sherman Wong^†, Zhenting Qi^†, Zhaodong Wang^†, Nathan Hu^†, Samuel Lin, Jun Ge, Erwin Gao, Wenlin Chen, Yilun Du, Minlan Yu, Ying Zhang

arXiv preprint 2025

We introduced the Confucius Code Agent (CCA), a software engineering agent that can operate at large-scale codebases.

arXiv PDF

EvoLM: In Search of Lost Language Model Training Dynamics

🏆 Oral Presentation

Zhenting Qi, Fan Nie, Alexandre Alahi, James Zou, Himabindu Lakkaraju, Yilun Du, Eric Xing, Sham Kakade, Hanlin Zhang

Advances in Neural Information Processing Systems (NeurIPS) 2025

We developed a comprehensive model suite for analyzing language model training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning stages.

arXiv PDF

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Zidi Xiong, Shan Chen, Zhenting Qi, Himabindu Lakkaraju

Advances in Neural Information Processing Systems (NeurIPS) 2025

We introduced a systematic framework to evaluate the faithfulness of thinking drafts in Large Reasoning Models using counterfactual interventions.

arXiv PDF

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Maohao Shen^†, Guangtao Zeng^†, Zhenting Qi^†, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan

International Conference on Machine Learning (ICML) 2025

We introduced the COAT reasoning framework to enhance LLM reasoning via autoregressive search with self-reflection and self-exploration.

arXiv PDF

rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Zhenting Qi^†, Mingyuan Ma^†, Jiahang Xu^†, Li Lyna Zhang, Fan Yang, Mao Yang

International Conference on Learning Representations (ICLR) 2025

We introduced rStar, a self-play mutual reasoning approach that enhances reasoning capabilities of small language models without fine-tuning or superior models.

arXiv PDF

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Zhenting Qi, Hanlin Zhang, Eric Xing, Sham Kakade, Himabindu Lakkaraju

International Conference on Learning Representations (ICLR) 2025

We developed a scalable method for extracting data from RAG systems using LLMs' instruction-following capabilities.

arXiv PDF

Quantifying Generalization Complexity for Large Language Models

Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass

International Conference on Learning Representations (ICLR) 2025

We introduced Scylla, a dynamic evaluation framework that quantitatively measures LLMs' generalization abilities by disentangling generalization from memorization.

arXiv PDF

FOLIO: Natural Language Reasoning with First-Order Logic

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri, Wojciech Kryscinski, Semih Yavuz, Ye Liu, Xi Victoria Lin, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Rex Ying, Arman Cohan, Dragomir Radev

Conference on Empirical Methods in Natural Language Processing (EMNLP) 2024

We developed a comprehensive dataset and benchmark for natural language reasoning using First-Order Logic.

arXiv PDF

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Simeng Han, Aaron Yu, Rui Shen, Zhenting Qi, Martin Riddell, Wenfei Zhou, Yujie Qiao, Yilun Zhao, Semih Yavuz, Ye Liu, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Dragomir Radev, Rex Ying, Arman Cohan

Conference on Empirical Methods in Natural Language Processing (EMNLP) 2024

We extended FOLIO with abundant human-written reasoning chains, providing detailed reasoning processes for logical reasoning tasks.

arXiv PDF

Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge

Weihua Du, Qiushi Lyu, Jiaming Shan, Zhenting Qi, Hongxin Zhang, Sunli Chen, Andi Peng, Tianmin Shu, Kwonjoon Lee, Behzad Dariush, Chuang Gan

Advances in Neural Information Processing Systems (NeurIPS) 2024

We introduced a comprehensive benchmark challenge for advancing research in embodied social intelligence through constrained human-AI cooperation scenarios.

arXiv PDF

PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

🏆 Oral Presentation

Zhenting Qi, Xiaoyu Tan, Shaojie Shi, Chao Qu, Yinghui Xu, Yuan Qi

Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023

We introduced a prompt matching framework to enhance the efficiency of instruction fine-tuning.

arXiv PDF

QTSumm: A New Benchmark for Query-Focused Table Summarization

Yilun Zhao, Zhenting Qi, Linyong Nan, Boyu Mi, Yixin Liu, Weijin Zou, Simeng Han, Xiangru Tang, Yumo Xu, Arman Cohan, Dragomir Radev

Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023

We introduced a comprehensive benchmark dataset for query-focused table summarization.

arXiv PDF

Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness

🏆 Oral Presentation

Xiaoyu Tan, Shaojie Shi, Xihe Qiu, Chao Qu, Zhenting Qi, Yinghui Xu, Yuan Qi

Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023

We introduced a self-criticism framework that enables models to evaluate and improve their outputs based on their understanding of helpfulness, honesty, and harmlessness.

OpenRT: An Open-source Framework for Reasoning Over Tabular Data

Yilun Zhao, Boyu Mi, Zhenting Qi, Linyong Nan, Minghao Guo, Arman Cohan, Dragomir Radev

Annual Meeting of the Association for Computational Linguistics (ACL) 2023

We developed and released an open-source framework for reasoning over tabular data.

RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations

Yilun Zhao, Chen Zhao, Linyong Nan, Zhenting Qi, Wenlin Zhang, Xiangru Tang, Boyu Mi, Dragomir Radev

Annual Meeting of the Association for Computational Linguistics (ACL) 2023

We conducted a systematic study of table QA robustness against human-annotated adversarial perturbations.

arXiv PDF

SaFER: A Robust and Efficient Framework for Fine-tuning BERT-based Classifier with Noisy Labels

Zhenting Qi, Xiaoyu Tan, Chao Qu, Yinghui Xu, Yuan Qi

Annual Meeting of the Association for Computational Linguistics (ACL) 2023

We developed a robust framework for fine-tuning BERT-based classifiers in the presence of noisy labels.

LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control

🏆 Oral Presentation

Yilun Zhao, Zhenting Qi, Linyong Nan, Lorenzo Jaime Flores, Dragomir Radev

Conference of the European Chapter of the Association for Computational Linguistics (EACL) 2023

We introduced logic form control mechanisms to guide table-to-text generation and ensure faithfulness to source data.

arXiv PDF

ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples

Yilun Zhao, Linyong Nan, Zhenting Qi, Rui Zhang, Dragomir Radev

Conference on Empirical Methods in Natural Language Processing (EMNLP) 2022

We developed methods to generate synthetic reasoning examples for table understanding tasks and integrated table reasoning skills into the pre-training phase.

arXiv PDF