JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space

Xinnan Zhu1,2,* Ruijie Xu1,2,* Jiayu Ying1 Daoguo Dong3 Jiachen Xu4 Yuan Xie1 Xin Tan1,2,†

1East China Normal University    2Shanghai Artificial Intelligence Laboratory    3Fudan University    4Tencent

*Equal contribution. Corresponding author.

Abstract

Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, JointEdit3D, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

Method

JointEdit3D method overview
RGB-geometry latent inpainting with source-scene anchoring. The edited reference frame specifies the target edit, while the SceneAnchor Branch preserves 3D scene structure without forcing direct copying.

Dataset

SceneEdit3D provides paired scene-level edits with renderer-side 3D supervision for training and standardized evaluation.

SceneEdit3D-15K data generation pipeline
SceneEdit3D-15K starts from composable 3D indoor scenes, proposes canonical edits with a VLM, executes valid edits in Blender, and renders paired source/target videos with RGB, depth, camera, mask, edited-reference, and instruction supervision.
SceneEdit3D-15K 15K paired editing samples covering scene-level object removal, movement, appearance edits, and mixed operations.
3D Annotations Renderer-provided geometry, camera, mask, and auxiliary supervision support RGB-geometry editing and evaluation.
SceneEdit3D-Bench A curated 100-sample benchmark for edit fidelity, background preservation, and 3D structural quality.

Dataset Samples

Each sample pairs source/edited videos with semi-transparent mask overlays, the example edit prompt, and source/edited 3D point-cloud previews.

Quantitative Results

We evaluate edit fidelity and background preservation with region-aware PSNR/LPIPS, and measure 3D structure against renderer-provided edited geometry.

Best Second-best

Runtime-quality comparison plot
JointEdit3D gives a favorable quality-efficiency trade-off in the 49-frame full-3D protocol.

Delete Edit Region

Values are reported as PSNR ↑ / LPIPS ↓.

Method SceneEdit3D-Bench PSNR ↑ / LPIPS ↓ 360-USID PSNR ↑ / LPIPS ↓
SPIn-NeRF19.20 / 0.52815.89 / 0.4826
Gaussian Grouping20.41 / 0.37716.67 / 0.4241
GScream20.38 / 0.44914.37 / 0.5725
GaussianEditor20.26 / 0.45116.23 / 0.3975
MVInpainter21.37 / 0.40615.60 / 0.5677
Omni-3DEdit24.86 / 0.30717.78 / 0.3655
JointEdit3D31.92 / 0.15118.57 / 0.3426

3D Geometry Quality

Point-cloud metrics against renderer-provided edited geometry.

Method 3D Source Accuracy ↓ Completeness ↓ CD ↓ F-score ↑
GT RGB + VGGTVGGT recon.0.07240.09500.08370.9619
SEVA + VGGTVGGT recon.0.14330.22440.18390.8722
Omni-3DEdit + VGGTVGGT recon.0.11650.13130.12390.9143
JointEdit3D RGB + VGGTVGGT recon.0.08910.12110.10510.9357
JointEdit3DJoint output0.11530.08860.10200.9702

Operation-wise Region Metrics

Average PSNR / LPIPS over Delete, Add, Move, Appearance, and Multi-op on SceneEdit3D-Bench.

Region Method Delete Add Move Appearance Multi-op Average
Bg.SEVA17.89 / 0.42117.74 / 0.42617.69 / 0.41119.00 / 0.37817.54 / 0.44517.93 / 0.418
Bg.MVInpainter33.57 / 0.08033.64 / 0.07630.97 / 0.08133.76 / 0.07330.06 / 0.09132.40 / 0.080
Bg.Omni-3DEdit25.00 / 0.21524.24 / 0.21724.40 / 0.21025.88 / 0.19023.81 / 0.23524.65 / 0.214
Bg.JointEdit3D32.62 / 0.07932.59 / 0.08131.22 / 0.06933.80 / 0.05930.84 / 0.09532.33 / 0.077
EditSEVA23.39 / 0.39915.92 / 0.52317.87 / 0.45017.50 / 0.46416.93 / 0.49618.72 / 0.465
EditMVInpainter21.37 / 0.40620.04 / 0.36216.99 / 0.43718.94 / 0.34917.92 / 0.42319.05 / 0.392
EditOmni-3DEdit24.86 / 0.30720.63 / 0.29318.55 / 0.37720.92 / 0.23017.03 / 0.46121.10 / 0.323
EditJointEdit3D31.92 / 0.15123.88 / 0.19523.19 / 0.24427.67 / 0.11523.77 / 0.26326.63 / 0.187

JointEdit3D Inference Demo Cases

Additional Results

More paper results are shown below.

Citation

@article{zhu2026jointedit3d,
  title={JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space},
  author={Zhu, Xinnan and Xu, Ruijie and Ying, Jiayu and Dong, Daoguo and Xu, Jiachen and Xie, Yuan and Tan, Xin},
  journal={arXiv preprint},
  year={2026}
}