Erwan

I am currently a PhD student at LAMSADE (Université Paris Dauphine-PSL, Paris, France) under the supervision of Alexandre Allauzen. I am part of the MILES team which is dedicated to machine learning.

My research is about frugal machine learning: I aim at making deep learning models more resource-efficient while maintaining their performance — the term "resource" referring for instance to time, memory, money, carbon emissions, etc. The PhD focuses mainly on transformers applied to NLP, as they are by far the most resource-consuming models in the field (hello ChatGPT). This is still a very broad topic, and multiple angles of attack are considered — from the architecture of the models to the optimization algorithm used to train them.

erwan.fagnou/AT/dauphine.psl.eu
LAMSADE, Université Paris Dauphine-PSL
Paris, France

News
Publications
Career & Education

News

🔥 Apr. 2026
Our paper "Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing" has been accepted to ICML 2026 (Seoul, South Korea)!
Jan. 2026
Our paper "Scaling Direct Feedback Learning with Jacobian Alignment Guarantees" has been accepted to ICLR 2026 (Rio de Janeiro, Brazil)!
Oct. 2025
We will present our paper "Forward Only Learning for Orthogonal Neural Networks of any Depth" at ECAI 2025 in Bologna, Italy.
Sept. 2025
We will be presenting our ICLR oral paper at the ECML-PKDD 2025 conference, in Porto.
Apr. 2025
Meet us at AISTATS 2025 in Phuket, Thailand, where we will present our paper "Bridging the Theoretical Gap in Randomized Smoothing"!
Jan. 2025
Our paper "Accelerated Training through Iterative Gradient Propagation Along the Residual Path" has been accepted as an oral to ICLR 2025 in Singapore!
Sept. 2024
Our paper "Chain and Causal Attention for Efficient Entity Tracking" has been accepted to EMNLP 2024 (main) in Miami! 🌴
Dec. 2023
Started my PhD at LAMSADE (Université Paris Dauphine-PSL) under the supervision of Alexandre Allauzen.

Publications

First author

July 2026
Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing
Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen
ICML 2026
Token mixing layers play a key role in how language models can learn and generate long-range dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.
Apr. 2025
Accelerated Training through Iterative Gradient Propagation Along the Residual Path
Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen
ICLR 2025 (Oral) [pdf]
Despite being the cornerstone of deep learning, backpropagation is criticized for its inherent sequentiality, which can limit the scalability of very deep models. Such models faced convergence issues due to vanishing gradient, later resolved using residual connections. Variants of these are now widely used in modern architecture. However, the computational cost of backpropagation remains a major burden, accounting for most of the training time. Taking advantage of residual-like architectural designs, we introduce Highway backpropagation, a parallelizable iterative algorithm that approximates backpropagation, by alternatively i) accumulating the gradient estimates along the residual path, and ii) backpropagating them through every layer in parallel. This algorithm is naturally derived from a decomposition of the gradient as the sum of gradients flowing through all paths and is adaptable to a diverse set of common architectures, ranging from ResNets and Transformers to recurrent neural networks. Through an extensive empirical study on a large selection of tasks and models, we evaluate Highway-BP and show that major speedups can be achieved with minimal performance degradation.
Nov. 2024
Chain and Causal Attention for Efficient Entity Tracking
Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen
EMNLP 2024 [pdf]
This paper investigates the limitations of transformers for entity-tracking tasks in large language models. We identify a theoretical constraint, showing that transformers require at least $\log_2 (n+1)$ layers to handle entity tracking with $n$ state changes. To address this issue, we propose an efficient and frugal enhancement to the standard attention mechanism, enabling it to manage long-term dependencies more efficiently. By considering attention as an adjacency matrix, our model can track entity states with a single layer. Empirical results demonstrate significant improvements in entity tracking datasets while keeping competitive performance on standard natural language modeling. Our modified attention allows us to achieve the same performance with drastically fewer layers. Additionally, our enhanced mechanism reveals structured internal representations of attention. Extensive experiments on both toy and complex datasets validate our approach. Our contributions include theoretical insights, an improved attention mechanism, and empirical validation.

Co-author

Apr. 2026
Scaling Direct Feedback Learning with Jacobian Alignment Guarantees
Paul Caillon, Erwan Fagnou, Blaise Delattre, Alexandre Allauzen
ICLR 2026 [pdf]
Deep neural networks rely on backpropagation (BP) for optimization, but its strictly sequential backward pass hinders parallelism and scalability. Direct Feedback Alignment (DFA) has been proposed as a promising approach for parallel learning of deep neural networks, relying on fixed random projections to enable layer-wise parallel updates, but fails on deep convolutional networks, and performs poorly on modern transformer architectures. We introduce GrAPE (Gradient-Aligned Projected Error), a hybridfeedback-alignment method that (i) estimates rank-1 Jacobians via forward-mode JVPs and (ii) aligns each layer's feedback matrix by minimizing a local cosine-alignment loss. To curb drift in very deep models, GrAPE performs infrequent BP anchor steps on a single mini-batch, preserving mostly parallel updates. We show that the forward-gradient estimator has strictly positive expected cosine with the true Jacobian. We relate this estimator-level guarantee to a standard stochastic-approximation result under a positive expected-cosine condition on the update direction, providing theoretical support for GrAPE's alignment objective. Empirically, GrAPE consistently outperforms prior alternatives to BP, enabling the training of modern architectures, closing a large fraction of the gap to BP while retaining layer-parallel updates for the vast majority of steps.
Oct. 2025
Forward Only Learning for Orthogonal Neural Networks of any Depth
Paul Caillon, Alex Colagrande, Erwan Fagnou, Blaise Delattre, Alexandre Allauzen
ECAI 2025 [pdf]
Backpropagation is still the de facto algorithm used today to train neural networks. With the exponential growth of recent architectures, the computational cost of this algorithm also becomes a burden. The recent PEPITA and forward-only frameworks have proposed promising alternatives, but they failed to scale up to a handful of hidden layers, yet limiting their use. In this paper, we first analyze theoretically the main limitations of these approaches. It allows us the design of a forward-only algorithm, which is equivalent to backpropagation under the linear and orthogonal assumptions. By relaxing the linear assumption, we then introduce FOTON (Forward-Only Training of Orthogonal Networks) that bridges the gap with the backpropagation algorithm. Experimental results show that FOTON outperforms PEPITA, enabling us to train neural networks of any depth, without the need for a backward pass. Moreover the performance of FOTON applied to convolutional networks clearly opens up avenues for its application to more advanced architectures.
May 2025
Bridging the Theoretical Gap in Randomized Smoothing
Blaise Delattre, Paul Caillon, Quentin Barthélemy, Erwan Fagnou, Alexandre Allauzen
AISTATS 2025 [pdf]
Randomized smoothing has become a leading approach for certifying adversarial robustness in machine learning models. However, a persistent gap remains between theoretical certified robustness and empirical robustness accuracy. This paper introduces a new framework that bridges this gap by leveraging Lipschitz continuity for certification and proposing a novel, less conservative method for computing confidence intervals in randomized smoothing. Our approach tightens the bounds of certified robustness, offering a more accurate reflection of model robustness in practice. Through rigorous experimentation we show that our method improves the robust accuracy, compressing the gap between empirical findings and previous theoretical results. We argue that investigating local Lipschitz constants and designing ad-hoc confidence intervals can further enhance the performance of randomized smoothing. These results pave the way for a deeper understanding of the relationship between Lipschitz continuity and certified robustness.

PhD

Dec. 2023 - Today

PhD student in the MILES team of the LAMSADE lab, at Université Paris Dauphine-PSL, under the supervision of Alexandre Allauzen. It is founded by the PEPR SHARP (Sharp Theoretical and Algorithmic Principles for frugal Machine Learning), which involves several actors collaborating towards frugal machine learning.

Education

2022 - 2023

ENS Paris-Saclay
Master 2: Mathematics, Vision and Learning (MVA)
Selective Master's degree in machine learning, preparing the students for research. See the official website (in French), or here (in English).

2020 - 2023

Télécom Paris
Master of Science
- GPA: 4.0
- 3rd year: MVA Master at ENS Paris-Saclay
- 2nd year: SD (Data Science) and MITRO (Mathematics, Theoretical Computer Science and Operational Research) tracks