Gabrijel Boduljak

I am a first-year doctoral student in the Visual Geometry Group (VGG) at the University of Oxford, advised by Christian Rupprecht, Iro Laina, and Andrea Vedaldi. My primary research interests are generative world modeling and learning intuitive physics from video. Additionally, I am interested in the theory of flow matching and mathematical optimal transport.

Research

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

Gabrijel Boduljak | Laurynas Karazija | Iro Laina | Christian Rupprecht | Andrea Vedaldi

In review.

While recent high-quality video generators are often presented as world models, we show that they struggle to accurately forecast motion even in simple physical scenarios, such as falling blocks or objects interacting mechanically, even after fine- tuning on such data. We hypothesize that this limitation arises from the overhead of generating pixels instead of motion. To address this, we develop a video-like diffusion model that generates (quasi-)dense motion trajectories rather than pixels, achieving significantly more efficient and accurate predictions. We also propose improved metrics to assess motion forecasting in ambiguous settings, and use them to show that our approach compares favorably against both video generators and prior motion forecasters.

Project Page →
On Vanishing Variance in Transformer Length Generalization

Gabrijel Boduljak | Ruining Li | Jensen (Jinghao) Zhou

1st Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models (SCOPE), ICLR 2025

It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction, though not a complete elimination of the distribution shift caused by vanishing variance.

Project Page →