We release our collected data in Google DriveSome of these data are userspecified while others are generated Each test case contains two folders namepose0 and namesyncpose0 refers to the monocular video sequencesync refers to the pseudo labels generated by SyncDreamer We recommend using PracticalRIFE if you need to introduce more frames in your video sequence

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

This work introduces 4DGen a novel holistic framework for grounded 4D content creation that decomposes the 4D generation task into multiple stages We identify static 3D assets and monocular video sequences as key components in constructing the 4D content

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

We introduce grounded 4D content generation We identify monocular video sequences as a key component in constructing the 4D content 4DGen Grounded 4D Content Generation with Spatialtemporal Consistency Grounded 4d content generation with spatialtemporal consistency authorYin Yuyang and Xu Dejia and Wang Zhangyang and Zhao

Aided by texttoimage and texttovideo diffusion models existing 4D content creation pipelines utilize score distillation sampling to optimize the entire dynamic 3D scene However as these pipelines generate 4D content from text or image inputs directly they are constrained by limited motion capabilities and depend on unreliable prompt engineering for desired results To address these

4dgen Grounded 4d Content Generation With Spatial Temporal

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

However as these pipelines generate 4D content from text or image inputs directly they are constrained by limited motion capabilities and depend on unreliable prompt engineering for desired results To address these problems this work introduces textbf4DGen a novel framework for grounded 4D content creation

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

To address the aforementioned challenges we introduce 4DGen a novel pipeline tackling a new task of Grounded 4D Generation which focuses on videoto4D generationAs shown in Fig 2 our primary strategy involves using monocular videos as conditional inputs to provide users with precise control over both the motion and appearance of generated 4D content

4dgen Grounded 4d Content Generation With Spatial Temporal

To enhance the fidelity of reconstruction spatialtemporal pseudo labels on anchor frames are employed through a multiview diffusion model The introduction of seamless consistency priors achieved via score distillation sampling and unsupervised smoothness regularizations further refines the 4D generation process

PDF 4DGen Grounded 4D Content Generation with Spatialtemporal

n Data Preparation n We release our collected data in Google Drive n Each test cases contains two folders namepose0 and namesyncpose0 refers to the monocular video sequencesync refers to the pseudo labels generated by SyncDreamer n We recommend using PracticalRIFE if you need to introduce more frames in your video sequence n To preprocess your own images into RGBA format

4DGen Grounded 4D Content Generation with Spatialtemporal Consistency

Figure 1 4DGen introduces grounded 4D content generation We present highquality rendered images from diverse viewpoints at distinct timesteps Our model exhibits rapid training and highquality results with outstanding spatialtemporal consistency 1 1 footnotetext Equal Contribution 2 2 footnotetext Corresponding Author 1 Introduction Figure 2 Previous work generates 4D content in one

4DGen is introduced a novel holistic framework for grounded 4D content creation that decomposes the 4D generation task into multiple stages and supports grounded generation offering users enhanced control a feature difficult to achieve with previous methods Aided by texttoimage and texttovideo diffusion models existing 4D content creation pipelines utilize score distillation sampling