3D Space as a Scratchpad for Editable Text-to-Image Generation

1 University of Massachusetts Amherst
2 Adobe Research
* Equal advising
3D Scratchpad Teaser

Spatial scratchpads enable structured 3D reasoning for controllable image generation. Our method constructs a 3D spatial scratchpad from a text prompt, representing subjects as editable meshes with explicit geometry and spatial relations. Users or language agents can manipulate this scene — moving, resizing, or reorienting objects — through either text-based or interactive edits. The updated 3D configuration is re-rendered via a 3D-guided text-to-image pipeline, producing identity-preserving and spatially coherent images that remain faithful to the original prompt. This demonstrates how 3D reasoning serves as an effective intermediate workspace linking linguistic intent and precise visual synthesis.

Abstract

Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad — a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision–language models that deliberate not only in language, but also in space.

Method

Overview of a 3D spatial scratchpad
Overview of a 3D spatial scratchpad. Given an input prompt P we illustrate how our method uses a 3D space as an underlying representation to generate an image that has superior alignment to the prompt. Agent 1 is responsible for decomposing the input prompt into subjects and background. Agent 2 provides 3D bounding boxes for each subject. We render the scratchpad and subsequently generate an image based on these placements which is then given to agent 3 that adjusts transformations of the meshes. Finally, agent 4 chooses the best camera viewpoint from a set of proposals to generate the final image.

Interactive 3D Scratchpad Visualization

Drag to rotate • Scroll to zoom
Drag to rotate • Scroll to zoom
Rendered image
Chosen proposal view
Generated image
Identity and Structure controlled image
Drag to rotate • Scroll to zoom
Drag to rotate • Scroll to zoom
Rendered image
Chosen proposal view
Generated image
Identity and Structure controlled image
Drag to rotate • Scroll to zoom
Drag to rotate • Scroll to zoom
Rendered image
Chosen proposal view
Generated image
Identity and Structure controlled image

Drag to rotate • Scroll to zoom

Consistent Editability

Consistent editability results
3D as a spatial scratchpad enables consistent image editing. Given the input prompt, we show the subjects instantiated, the scratchpad created and the subsequent image generated in the first column. In the succeeding columns, we show how edits made either through a) manual editing, or b) text-based editing to the scratchpad can be consistently reflected on to the final image while conserving identities of subjects and the background. Each column shows a progressive edit made over the state in the previous column.