3D Space as a Scratchpad for Editable Text-to-Image Generation

Spatial scratchpads enable structured 3D reasoning for controllable image generation. Our method constructs a 3D spatial scratchpad from a text prompt, representing subjects as editable meshes with explicit geometry and spatial relations. Users or language agents can manipulate this scene — moving, resizing, or reorienting objects — through either text-based or interactive edits. The updated 3D configuration is re-rendered via a 3D-guided text-to-image pipeline, producing identity-preserving and spatially coherent images that remain faithful to the original prompt. This demonstrates how 3D reasoning serves as an effective intermediate workspace linking linguistic intent and precise visual synthesis.

Abstract

Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad — a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision–language models that deliberate not only in language, but also in space.

Method

**Overview of a 3D spatial scratchpad.** Given an input prompt P we illustrate how our method uses a 3D space as an underlying representation to generate an image that has superior alignment to the prompt. Agent 1 is responsible for decomposing the input prompt into subjects and background. Agent 2 provides 3D bounding boxes for each subject. We render the scratchpad and subsequently generate an image based on these placements which is then given to agent 3 that adjusts transformations of the meshes. Finally, agent 4 chooses the best camera viewpoint from a set of proposals to generate the final image.

Interactive 3D Scratchpad Visualization

Drag to rotate • Scroll to zoom

Generated image — Identity and Structure controlled image

Drag to rotate • Scroll to zoom

View more examples

Abstract

Method

Interactive 3D Scratchpad Visualization

Consistent Editability