3D Judge

To reach SoTA, the system should:

  • outperform CLIP-style and image-only judge baselines,
  • outperform ImageReward-style preference predictors,
  • and beat Eval3D on human correlation or pairwise agreement.

The goal is a judge that measures real 3D structural correctness, not just 2D plausibility.

Research Gap

  1. Janus and view inconsistency
  • 3D outputs can look plausible in one view while failing across views.
  1. Data scarcity
  • Good 3D data is scarce, expensive, and narrow.
  1. Benchmark weakness
  • Current metrics over-rely on image-text similarity and under-measure true 3D correctness.
  1. Compute and representation limits
  • 3D models remain expensive in tokens, compute, and representation overhead.
  1. Open limitations across the literature
  • Existing systems still struggle with robust structural verification.

Benchmarks

  1. Eval3D
  • Primary benchmark for a new 3D judge.
  • 160 prompts, generated 3D assets, and human ratings.
  • Main target: beat the commonly cited 83% to 88% human agreement range.
  1. T3-Bench
  • Secondary baseline for older multi-view image-text evaluators.
  1. Core grading axes
  • Geometric consistency: detect floaters, noisy surfaces, and texture-geometry mismatch.
  • Structural consistency: detect Janus failures, duplicated parts, and impossible shape layouts.
  • Semantic consistency: detect objects whose identity changes across viewpoints.
  • Prompt alignment: detect missing prompt-critical details.
  1. 3D MM-Vet
  • Useful only if the judge is an MLLM.
  • Validates basic 3D spatial and visual grounding.