V-JEPA2

4.2.1 Successes: Permanence and Occlusion

V-JEPA demonstrates robust scaling on object permanence. The model effectively learns that objects do not cease to exist when occluded. Notably, for videos where physics-breaking events occur behind an occluder, V-JEPA’s performance is “well-correlated” with human performance. This confirms that the masking-based pre-training objective, predicting masked regions, is a direct proxy for learning permanence.

4.2.2 Failures: The Solidity and Gravity Gap

The model struggles significantly with solidity, where objects pass through each other, and gravity. The lack of significant improvement on solidity () is a critical finding.

  • Insight: While JEPA captures the concept of an object, it may not capture the precise, hard boundaries required to detect subtle interpenetrations. Unlike a pixel model, which might see a texture clash, or a physics engine, which calculates vertex intersection, a latent predictor might simply smooth over the collision and interpret it as a morphing or occlusion event. Research identifies framerate constraints.
  • Gravity: While later iterations show improvement, gravity requires second-order derivation: acceleration. Visual observation alone, without a fixed reference frame or proprioceptive feedback, makes distinguishing “falling” from “moving down” difficult.

4.3 The IntPhys 2 “Chance Level” Crisis

Despite improvements, the introduction of the IntPhys 2 benchmark has exposed the limitations of current scaling laws. IntPhys 2 introduces more complex scenarios involving object immutability and causal chains.

  • Observation: On IntPhys 2, V-JEPA 2 and other state-of-the-art video models perform “at or close to chance,” whereas human baselines remain in the 85-95% accuracy range.
  • Implication: This saturation suggests that passive video observation, even at the scale of 22 million videos, hits a ceiling for causal reasoning. The model learns correlations, what typically follows what, but lacks the causal counterfactual robustness, what cannot happen, that humans derive from embodied interaction.

1. Core Bottlenecks

From first principles:

A. Objective Design: Making JEPA + Slots + 3D + Dynamics Compatible

  • JEPA wants to learn an embedding where future latents are predictable from context, often encouraging invariance to nuisance factors.
  • Object slots need a representation that is:
    • factorized over objects
    • equivariant to changes in object state: pose, position, and occlusion
  • 3D geometry wants consistency under viewpoint changes, meaning camera pose, which is also inherently equivariant.
  • Dynamics wants the latent to change in a structured way under time and actions.

The bottleneck is designing an energy or contrastive JEPA objective that simultaneously:

  • keeps slot structure intact
  • preserves 3D geometry
  • allows a dynamics module to predict futures
  • does not collapse to trivial “everything is one blob” or “only camera motion” representations

This raises two questions:

  • How do you define positive and negative pairs at the object level (slots) and the 3D level?
  • How do you avoid JEPA learning a representation that is too invariant, such as ignoring fine-grained physics, or too entangled, with no clean slots?

Objective design is the single biggest scientific bottleneck.