Hello! I'm Zhenting Qi

I am a second-year Master’s student in Computational Science Engineering at Harvard University. Prior to Harvard, I received dual bachelor’s degree in Computer Engineering from UIUC and ZJU with highest honors.

I am currently a member at AI4LIFE group at Harvard, advised by Prof. Hima Lakkaraju. Previously, I was a student researcher at Yale (2022-2023) advised by Prof. Dragomir R. Radev, and at UIUC (2022) advised by Prof. Volodymyr Kindratenko. I also interned at Microsoft Research Asia (2023-2024), advised by Dr. Li Lyna Zhang.

I am open to research or internship opportunities! Please contact me at zhentingqi[AT]g[DOT]harvard[DOT]edu.


Research

My research interest lies broadly in Language Modeling and Generative AI. My long-term research goal is to build intelligent and reliable AI systems for the benefit of human society. Motivated by this goal, I am currently interested in the following topics (w/o particular order)

1) Reasoning: Why do current reasoning paradigms work and how do we improve them? How can we design AI systems to perform robust reasoning that generalizes well to out-of-distribution settings?

2) Reliability: How do we better understand AI systems and further enhance their controllability and robustness? How can we design scalable method to ensure their safety while we are improving their capabilities?

For more information, please see Google Scholar, Semantic Scholar and DBLP.

News


Publications (Selected)

Quantifying Generalization Complexity for Large Language Models

Quantifying Generalization Complexity for Large Language Models

Preprint.

We first introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. We also uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the "generalization valley". This phenomenon reveals "critical complexity" where reliance on non-generalizable behavior peaks, indicating the upper bound of generalization capabilities.

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Preprint.

We introduce rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models.

FOLIO: Natural Language Reasoning with First-Order Logic

FOLIO: Natural Language Reasoning with First-Order Logic

EMNLP 2024

We present FOLIO, a human-annotated, open-domain, and logically complex and diverse dataset for reasoning in natural language (NL), equipped with first order logic (FOL) annotations.

Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge

Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge

NeurIPS 2024

We introduce the Constrained Human-AI Cooperation (CHAIC), an inclusive embodied social intelligence challenge for testing social perception and cooperation in embodied agents.

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models (DPFM)

We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with instruction-tuned LMs via prompt injection.

PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

EMNLP 2023, Industry Track Oral Presentation

We improve LoRA-finetuned LLMs with a prompt matching framework and reached performance on par with full SFT.

QTSumm: A New Benchmark for Query-Focused Table Summarization

QTSumm: A New Benchmark for Query-Focused Table Summarization

EMNLP 2023

We introduce a new benchmark named QTSUMM for query-focused table summarization, which contains 7,111 human-annotated query-summary pairs over 2,934 tables covering diverse topics.

SaFER: A Robust and Efficient Framework for Finetuning BERT-based Classifier with Noisy Labels

SaFER: A Robust and Efficient Framework for Finetuning BERT-based Classifier with Noisy Labels

ACL 2023, Industry Track

We propose a robust and efficient fine-tuning framework for BERT-based text classifiers, combating label noises without access to any clean data for training or validation.

LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control

LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control

EACL 2023, Short Paper Oral Presentation

LOFT utilizes logic forms as fact verifiers and content planners to control Logical Table-to-Text generation.


Industry

Research Assistant in LLMs, Microsoft Research Asia (Beijing, China)

Research Assistant in LLMs, Microsoft Research Asia (Beijing, China)

Oct 2023 - Jun 2024

Researched on LLM Reasoning.

Research Assistant in LLMs, MIT-IBM Watson AI Lab (Cambridge, MA, U.S.)

Research Assistant in LLMs, MIT-IBM Watson AI Lab (Cambridge, MA, U.S.)

Aug 2024 - Present

Researched on LLM pre-training and post-training.

Research Intern in AI Algorithms, INF Technology (Shanghai, China)

Research Intern in AI Algorithms, INF Technology (Shanghai, China)

Feb 2022 - Jun 2023

Researched on LLM in industrial applications.


Miscellaneous

I enjoy playing basketball, reading, cooking, movies, and I am a huge fan of guitar and music!