Max Zimmer

Fourth-year PhD candidate in Mathematics at TU Berlin

Research Area Lead of iol.LEARN at the IOL Lab of Zuse Institute Berlin
Advisor: Prof. Dr. Sebastian Pokutta

Research: My research interests focus on Deep Learning, Efficient ML, and Optimization. Specifically, I work on enhancing the efficiency of large Neural Networks through sparsity, pruning, quantization, and low-rank optimization. Additionally, I am interested in Federated Learning, Explainability & Fairness, as well as applying ML to solve pure mathematics problems. Please take a look at my list of publications and feel free to reach out for questions or potential collaborations!

Previously: Before joining IOL as a student researcher in 2020, I worked on Nash Flows over Time with Leon Sering at the COGA Group at TU Berlin. During my BSc and MSc in Mathematics at TU Berlin, I got the chance to intern in the research groups of Prof. Sergio de Rosa at Università degli Studi di Napoli Federico II and Prof. Marco Mondelli at IST Austria. Since 2022, I have been a member of the BMS graduate school, part of the MATH+ Cluster of Excellence. You can find my full CV here.

latest news [see all]

10/2024	Happy to share that I am now the Research Area Lead of the Machine Learning subgroup of IOL, iol.LEARN, alongside Sai Ganesh Nagarajan!
07/2024	Join our Team in Berlin - We are seeking highly motivated PhD students to work on (efficient) Deep Learning, preferably with strong math/CS background and PyTorch experience. Happy to answer questions via email or at ICML2024 in Vienna! Directly apply here!
06/2024	Our paper Estimating Canopy Height at Scale has been accepted to ICML24 and is now available on arXiv! Checkout the map on the Earth Engine!

selected publications [see all]

Estimating Canopy Height at Scale

Jan Pauls, Max Zimmer, Una M. Kelly, Martin Schwartz, Sassan Saatchi, Philippe CIAIS, Sebastian Pokutta, Martin Brandt, and Fabian Gieseke

In Forty-first International Conference on Machine Learning, 2024

Abs Bib PDF Code Poster Earth Engine

We propose a framework for global-scale canopy height estimation based on satellite data. Our model leverages advanced data preprocessing techniques, resorts to a novel loss function designed to counter geolocation inaccuracies inherent in the ground-truth height measurements, and employs data from the Shuttle Radar Topography Mission to effectively filter out erroneous labels in mountainous regions, enhancing the reliability of our predictions in those areas. A comparison between predictions and ground-truth labels yields an MAE / RMSE of 2.43 / 4.73 (meters) overall and 4.45 / 6.72 (meters) for trees taller than five meters, which depicts a substantial improvement compared to existing global-scale maps. The resulting height map as well as the underlying framework will facilitate and enhance ecological analyses at a global scale, including, but not limited to, large-scale forest and biomass monitoring.
@inproceedings{pauls2024estimating, title = {Estimating Canopy Height at Scale}, author = {Pauls, Jan and Zimmer, Max and Kelly, Una M. and Schwartz, Martin and Saatchi, Sassan and CIAIS, Philippe and Pokutta, Sebastian and Brandt, Martin and Gieseke, Fabian}, booktitle = {Forty-first International Conference on Machine Learning}, year = {2024}, url = {https://openreview.net/forum?id=ZzCY0fRver}, }
Extending the Continuum of Six-Colorings

Konrad Mundinger, Sebastian Pokutta, Christoph Spiegel, and Max Zimmer

arXiv preprint arXiv:2404.05509, 2024

Abs Bib PDF

We present two novel six-colorings of the Euclidean plane that avoid monochromatic pairs of points at unit distance in five colors and monochromatic pairs at another specified distance d in the sixth color. Such colorings have previously been known to exist for 0.41 < \sqrt2 - 1 \le d \le 1 / \sqrt5 < 0.45. Our results significantly expand that range to 0.354 \le d \le 0.657, the first improvement in 30 years. Notably, the constructions underlying this were derived by formalizing colorings suggested by a custom machine learning approach.
@article{mundinger2024extending, title = {Extending the Continuum of Six-Colorings}, author = {Mundinger, Konrad and Pokutta, Sebastian and Spiegel, Christoph and Zimmer, Max}, year = {2024}, primaryclass = {math.CO}, journal = {arXiv preprint arXiv:2404.05509}, }
Neural Parameter Regression for Explicit Representations of PDE Solution Operators

Konrad Mundinger, Max Zimmer, and Sebastian Pokutta

ICLR 2024 Workshop on AI4DifferentialEquations In Science, 2024

Abs Bib PDF Poster

We introduce Neural Parameter Regression (NPR), a novel framework specifically developed for learning solution operators in Partial Differential Equations (PDEs). Tailored for operator learning, this approach surpasses traditional DeepONets (Lu et al., 2021) by employing Physics-Informed Neural Network (PINN, Raissi et al., 2019) techniques to regress Neural Network (NN) parameters. By parametrizing each solution based on specific initial conditions, it effectively approximates a mapping between function spaces. Our method enhances parameter efficiency by incorporating low-rank matrices, thereby boosting computational efficiency and scalability. The framework shows remarkable adaptability to new initial and boundary conditions, allowing for rapid fine-tuning and inference, even in cases of out-of-distribution examples.
@article{mundinger2024neural, author = {Mundinger, Konrad and Zimmer, Max and Pokutta, Sebastian}, title = {Neural Parameter Regression for Explicit Representations of PDE Solution Operators}, year = {2024}, journal = {ICLR 2024 Workshop on AI4DifferentialEquations In Science}, }
On the Byzantine-Resilience of Distillation-Based Federated Learning

Christophe Roux, Max Zimmer, and Sebastian Pokutta

arXiv preprint arXiv:2402.12265, 2024

Abs Bib PDF Code

Federated Learning (FL) algorithms using Knowledge Distillation (KD) have received increasing attention due to their favorable properties with respect to privacy, non-i.i.d. data and communication cost. These methods depart from transmitting model parameters and, instead, communicate information about a learning task by sharing predictions on a public dataset. In this work, we study the performance of such approaches in the byzantine setting, where a subset of the clients act in an adversarial manner aiming to disrupt the learning process. We show that KD-based FL algorithms are remarkably resilient and analyze how byzantine clients can influence the learning process compared to Federated Averaging. Based on these insights, we introduce two new byzantine attacks and demonstrate that they are effective against prior byzantine-resilient methods. Additionally, we propose FilterExp, a novel method designed to enhance the byzantine resilience of KD-based FL algorithms and demonstrate its efficacy. Finally, we provide a general method to make attacks harder to detect, improving their effectiveness.
@article{roux2024byzantine, author = {Roux, Christophe and Zimmer, Max and Pokutta, Sebastian}, title = {On the Byzantine-Resilience of Distillation-Based Federated Learning}, year = {2024}, journal = {arXiv preprint arXiv:2402.12265}, }
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs

Max Zimmer, Megi Andoni, Christoph Spiegel, and Sebastian Pokutta

arXiv preprint arXiv:2312.15230, 2023

Abs Bib PDF Code

Neural Networks can be efficiently compressed through pruning, significantly reducing storage and computational demands while maintaining predictive performance. Simple yet effective methods like Iterative Magnitude Pruning (IMP, Han et al., 2015) remove less important parameters and require a costly retraining procedure to recover performance after pruning. However, with the rise of Large Language Models (LLMs), full retraining has become infeasible due to memory and compute constraints. In this study, we challenge the practice of retraining all parameters by demonstrating that updating only a small subset of highly expressive parameters is often sufficient to recover or even improve performance compared to full retraining. Surprisingly, retraining as little as 0.27%-0.35% of the parameters of GPT-architectures (OPT-2.7B/6.7B/13B/30B) achieves comparable performance to One Shot IMP across various sparsity levels. Our method, Parameter-Efficient Retraining after Pruning (PERP), drastically reduces compute and memory demands, enabling pruning and retraining of up to 30 billion parameter models on a single NVIDIA A100 GPU within minutes. Despite magnitude pruning being considered as unsuited for pruning LLMs, our findings show that PERP positions it as a strong contender against state-of-the-art retraining-free approaches such as Wanda (Sun et al., 2023) and SparseGPT (Frantar & Alistarh, 2023), opening up a promising alternative to avoiding retraining.
@article{zimmer2023perp, author = {Zimmer, Max and Andoni, Megi and Spiegel, Christoph and Pokutta, Sebastian}, title = {PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs}, year = {2023}, journal = {arXiv preprint arXiv:2312.15230}, }
Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging

Max Zimmer, Christoph Spiegel, and Sebastian Pokutta

In International Conference on Learning Representations, 2024

Abs Bib PDF Code Poster

Neural networks can be significantly compressed by pruning, leading to sparse models requiring considerably less storage and floating-point operations while maintaining predictive performance. Model soups (Wortsman et al., 2022) improve generalization and out-of-distribution performance by averaging the parameters of multiple models into a single one without increased inference time. However, identifying models in the same loss basin to leverage both sparsity and parameter averaging is challenging, as averaging arbitrary sparse models reduces the overall sparsity due to differing sparse connectivities. In this work, we address these challenges by demonstrating that exploring a single retraining phase of Iterative Magnitude Pruning (IMP) with varying hyperparameter configurations, such as batch ordering or weight decay, produces models that are suitable for averaging and share the same sparse connectivity by design. Averaging these models significantly enhances generalization performance compared to their individual components. Building on this idea, we introduce Sparse Model Soups (SMS), a novel method for merging sparse models by initiating each prune-retrain cycle with the averaged model of the previous phase. SMS maintains sparsity, exploits sparse network benefits being modular and fully parallelizable, and substantially improves IMP’s performance. Additionally, we demonstrate that SMS can be adapted to enhance the performance of state-of-the-art pruning during training approaches.
@inproceedings{zimmer2023sparse, title = {Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging}, author = {Zimmer, Max and Spiegel, Christoph and Pokutta, Sebastian}, booktitle = {International Conference on Learning Representations}, year = {2024}, }
Interpretability Guarantees with Merlin-Arthur Classifiers

Stephan Wäldchen, Kartikey Sharma, Berkant Turan, Max Zimmer, and Sebastian Pokutta

In International Conference on Artificial Intelligence and Statistics, 2024

Abs Bib PDF Poster

We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
@inproceedings{waldchen2024interpretability, title = {Interpretability Guarantees with Merlin-Arthur Classifiers}, author = {W{\"a}ldchen, Stephan and Sharma, Kartikey and Turan, Berkant and Zimmer, Max and Pokutta, Sebastian}, booktitle = {International Conference on Artificial Intelligence and Statistics}, year = {2024}, }
How I Learned To Stop Worrying And Love Retraining

Max Zimmer, Christoph Spiegel, and Sebastian Pokutta

In International Conference on Learning Representations, 2023

Abs Bib PDF Code Poster

Many Neural Network Pruning approaches consist of several iterative training and pruning steps, seemingly losing a significant amount of their performance after pruning and then recovering it in the subsequent retraining phase. Recent works of Renda et al. (2020) and Le & Hua (2021) demonstrate the significance of the learning rate schedule during the retraining phase and propose specific heuristics for choosing such a schedule for IMP (Han et al., 2015). We place these findings in the context of the results of Li et al. (2020) regarding the training of models within a fixed training budget and demonstrate that, consequently, the retraining phase can be massively shortened using a simple linear learning rate schedule. Improving on existing retraining approaches, we additionally propose a method to adaptively select the initial value of the linear schedule. Going a step further, we propose similarly imposing a budget on the initial dense training phase and show that the resulting simple and efficient method is capable of outperforming significantly more complex or heavily parameterized state-of-the-art approaches that attempt to sparsify the network during training. These findings not only advance our understanding of the retraining phase, but more broadly question the belief that one should aim to avoid the need for retraining and reduce the negative effects of ‘hard’ pruning by incorporating the sparsification process into the standard training.
@inproceedings{Zimmer2023, author = {Zimmer, Max and Spiegel, Christoph and Pokutta, Sebastian}, booktitle = {International Conference on Learning Representations}, title = {{H}ow {I} {L}earned {T}o {S}top {W}orrying {A}nd {L}ove {R}etraining}, year = {2023}, url = {https://openreview.net/forum?id=_nF5imFKQI}, }
Compression-aware Training of Neural Networks using Frank-Wolfe

Max Zimmer, Christoph Spiegel, and Sebastian Pokutta

arXiv preprint arXiv:2205.11921, 2022

Abs Bib PDF Code

Many existing Neural Network pruning approaches either rely on retraining to compensate for pruning-caused performance degradation or they induce strong biases to converge to a specific sparse solution throughout training. A third paradigm obtains a wide range of compression ratios from a single dense training run while also avoiding retraining. Recent work of Pokutta et al. (2020) and Miao et al. (2022) suggests that the Stochastic Frank-Wolfe (SFW) algorithm is particularly suited for training state-of-the-art models that are robust to compression. We propose leveraging k-support norm ball constraints and demonstrate significant improvements over the results of Miao et al. (2022) in the case of unstructured pruning. We also extend these ideas to the structured pruning domain and propose novel approaches to both ensure robustness to the pruning of convolutional filters as well as to low-rank tensor decompositions of convolutional layers. In the latter case, our approach performs on-par with nuclear-norm regularization baselines while requiring only half of the computational resources. Our findings also indicate that the robustness of SFW-trained models largely depends on the gradient rescaling of the learning rate and we establish a theoretical foundation for that practice.
@article{zimmer2022compression, title = {Compression-aware Training of Neural Networks using Frank-Wolfe}, author = {Zimmer, Max and Spiegel, Christoph and Pokutta, Sebastian}, journal = {arXiv preprint arXiv:2205.11921}, year = {2022}, }
Deep Neural Network training with Frank-Wolfe

Sebastian Pokutta, Christoph Spiegel, and Max Zimmer

arXiv preprint arXiv:2010.07243, 2020

Abs Bib PDF Code

This paper studies the empirical efficacy and benefits of using projection-free first-order methods in the form of Conditional Gradients, a.k.a. Frank-Wolfe methods, for training Neural Networks with constrained parameters. We draw comparisons both to current state-of-the-art stochastic Gradient Descent methods as well as across different variants of stochastic Conditional Gradients. In particular, we show the general feasibility of training Neural Networks whose parameters are constrained by a convex feasible region using Frank-Wolfe algorithms and compare different stochastic variants. We then show that, by choosing an appropriate region, one can achieve performance exceeding that of unconstrained stochastic Gradient Descent and matching state-of-the-art results relying on L^2-regularization. Lastly, we also demonstrate that, besides impacting performance, the particular choice of constraints can have a drastic impact on the learned representations.
@article{pokutta2020deep, title = {Deep Neural Network training with Frank-Wolfe}, author = {Pokutta, Sebastian and Spiegel, Christoph and Zimmer, Max}, journal = {arXiv preprint arXiv:2010.07243}, year = {2020}, }