Achieving Content Consistency in Parallel Multi-Trajectory
Camera-Controlled Video Generation
Camera-controlled video generation is valuable for applications ranging from visual design to providing 2D supervision for 4D generation tasks. However, existing approaches are limited to single-trajectory generation, forcing users to process multiple trajectories in separate batches. This serial inference introduces content inconsistencies across viewpoints due to the inherent randomness of diffusion models. Explicit point cloud methods can only partially address this problem, as single-viewpoint back-projection suffers from sparsity and depth estimation errors. We propose CameraSquad, a multi-trajectory camera control framework that supports both single-trajectory and parallel multi-trajectory generation. Our method achieves precise camera control while preserving input video content through decoupled content and camera control mechanisms. To ensure viewpoint consistency in multi-trajectory mode, we design a dual-mode cross-view attention mechanism that maintains consistency across parallel trajectories while guaranteeing camera control precision. Extensive experiments demonstrate that CameraSquad achieves competitive performance in camera control accuracy, consistency maintenance, and generation quality compared to existing approaches.
Figure 2: Overview of CameraSquad. (a) The self-attention layers are adapted into content attention, and PRoPE is employed to provide fundamental single-trajectory camera control. (b) A Dual-Mode Cross-View Attention mechanism comprising CVA-α for subject consistency and CVA-β for relative perspective accuracy.
We repurpose DiT's original 3D self-attention as Content-Attention to preserve video content, and build a separate Camera-Attention pathway using the PRoPE mechanism. PRoPE encodes camera intrinsics and extrinsics into attention Q/K/V transformation matrices via projection geometry, enabling precise camera control without compressing to 1D values.
CVA-α ensures content consistency: reference video tokens provide K/V while each trajectory's noisy tokens serve as Q, making tokens from the same frame across different views mutually visible. CVA-β ensures geometric consistency: it repurposes PRoPE spatial attention from "along frame dimension" to "along view dimension", enabling multi-view geometric supervision.
Using Depth Anything 3 (DA3) for multi-view depth estimation, CameraSquad back-projects content-consistent multi-view frames into dynamic point clouds. The resulting point clouds are larger, finer, and more complete than single-view methods, providing high-quality 3D world states for downstream 4D reconstruction.
Figure 3: Qualitative comparison of single trajectory video synthesis. CameraSquad achieves competitive generation quality compared to existing state-of-the-art camera control methods.
Figure 4: Qualitative comparison of multi-view point cloud reprojection. CameraSquad preserves pixels across varying perspectives, while baseline methods exhibit varying degrees of loss in background or subject regions.
Figure 7: Visualization of 6-trajectory generation results. CameraSquad enables synchronous generation of up to 6 trajectories, consistently producing stable outputs for both human and landscape videos.
Figure 6: Qualitative comparison of multi-trajectory video synthesis with existing state-of-the-art camera control methods.
Figure 8: More results of qualitative comparison of single trajectory video synthesis.
Demo videos will be added soon.
@article{xu2026camerasquad, title = {CameraSquad: Achieving Content Consistency in Parallel Multi-Trajectory Camera-Controlled Video Generation}, author = {Xu, Zhufeng and Gao, Xuan and Deng, Bailin and Ding, Yikang and Liu, Xiaoqiang and Zhang, Haoxian and Wan, Pengfei and Fu, Hongbo and Gao, Lin}, journal = {ACM Transactions on Graphics}, year = {2026}, publisher = {ACM} }