EMNLP 2025 Tutorial
NLP+Code: Code Intelligence in Language Models

1Monash University, 2CSIRO's Data61, 3ByteDance
4Meta, 5NVIDIA, 6Alibaba Group, 7Hugging Face
code-lm@googlegroups.com

Saturday, Nov 8, 9:00 - 12:30 @ Suzhou International Expo Centre

About this tutorial

Language models (LMs) like GPT and Claude have shown impressive abilities in a range of natural language processing (NLP) tasks. Among these tasks, code understanding and generation have quickly become one of the most popular applications of LMs, given its nature of executable logic forms. However, there is a practical understanding of how programming knowledge can be combined with natural language to automate software development. Moreover, recent studies also empirically demonstrate that code can be a better form for complex reasoning and agentic task automation, but they do not indicate their significance.

In this tutorial, we deem such superior capabilities brought by code modeling as Code Intelligence, and aim to provide a coherent overview of recent advances in this topic. We will start by first providing preliminaries of training foundation models on code and their common practices. We will then focus on downstream tasks in the domain of code and their evaluations. Then, we will cover how code can contribute to advancements in general tasks, and the opportunities of future research on Code Intelligence.

Schedule

Our tutorial will be held on Saturday, Nov 8 (all the times are based on Beijing Time = UTC+8). Schedule may be subject to updates.

Time Section Presenter Slides
9:00—9:15 Section 1: Introduction Loubna/Terry [Slides]
9:15—9:30 Section 2: Preliminaries Terry [Slides]
9:30—9:50 Section 3: Post-training Code LMs: Supervised Fine-Tuning Wasi [Slides]
9:50—10:15 Section 4: Post-training Code LMs: Reinforcement Learning Binyuan [Slides]
10:15—10:25 Q & A Session I
10:25—10:55 Coffee Break
10:55—11:15 Section 5: Evaluating Code LMs: Function-level Code Generation Terry [Slides]
11:15—11:35 Section 6: Evaluating Code LMs: Repo-level & Agentic Code Generation Zijian [Slides]
11:35—11:55 Section 7: Bridging between Code and Natural Language Qian [Slides]
11:55—12:10 Section 8: Special Topics Terry/Loubna [Slides]
12:10—12:20 Section 9: Conclusion Terry [Slides]
12:20—12:30 Q & A Session II

Reading List


Post-training Code LMs: Supervised Fine-Tuning

  • DeepSeek-V3 Technical Report (Paper)
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Paper)
  • Qwen3 Technical Report (Paper)
  • The Llama 3 Herd of Models (Paper)
  • Magicoder: Empowering Code Generation with OSS-Instruct (Paper)
  • SelfCodeAlign: Self-Alignment for Code Generation (Paper)
  • OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs (Paper)
  • OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (Paper)
  • OpenCodeReasoning: Advancing Data Distillation for Competitive Coding (Paper)
  • AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy (Paper)
  • Llama 4: Multimodal Intelligence (Blog Post)

Post-training Code LMs: Reinforcement Learning

  • Execution-based Code Generation using Deep Reinforcement Learning (Paper)
  • RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning (Paper)
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Paper)
  • DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level (Blog Post)
  • Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution (Paper)
  • Evolving SkyRL into a Highly-Modular RL Framework (Blog Post)
  • Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval (Blog Post)
  • Improving Cursor Tab with Online RL (Blog Post)

Evaluating Code LMs: Function-level Code Generation

  • Evaluating Large Language Models Trained on Code (Paper)
  • Program Synthesis with Large Language Models (Paper)
  • Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation (Paper)
  • BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions (Paper)
  • BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution (Paper)
  • MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation (Paper)
  • LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (Paper)
  • Aider Polyglot (Link)
  • Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions (Paper)

Evaluating Code LMs: Repo-level & Agentic Code Generation

  • SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Paper)
  • SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents (Paper)
  • Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving (Paper)
  • SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? (Paper)
  • SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? (Paper)
  • SWE-Lancer: Evaluating the Economic Impact of AI on Software Engineering Freelance Markets (Paper)
  • SWE-Bench Pro: Assessing Language Models on Complex and Diverse Software Engineering Tasks (Paper)
  • SWE-Bench-Live: Real-Time Evaluation of Language Models on Live Software Engineering Issues (Paper)
  • CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (Paper)
  • Commit0: Library Generation from Scratch (Paper)
  • Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks (Paper)

Bridging between Code and Natural Language

  • Show Your Work: Scratchpads for Intermediate Computation with Language Models (Paper)
  • PAL: Program-aided Language Models (Paper)
  • ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving (Paper)
  • Qwen2.5-Coder Technical Report (Paper)
  • Lemur: Harmonizing Natural Language and Code for Language Agents (Paper)
  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Paper)
  • Reasoning Like Program Executors (Paper)
  • CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction (Paper)
  • CWM: An Open-Weights LLM for Research on Code Generation with World Models (Paper)
  • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks (Paper)
  • SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning (Paper)
  • SmolLM3: smol, multilingual, long-context reasoner (Blog Post)
  • Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models (Paper)

BibTeX

@inproceedings{zhuo2025codelm,
      title={NLP+ Code: Code Intelligence in Language Models},
      author={Zhuo, Terry Yue and Liu, Qian and Wang, Zijian and Ahmad, Wasi U and Hui, Binuian and Allal, Loubna Ben},
      booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts},
      pages={9--11},
      year={2025}
    }