skip to main content
research-article
Open Access

Promptable Game Models: Text-guided Game Simulation via Masked Diffusion Models

Published:03 January 2024Publication History

Skip Abstract Section

Abstract

Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment’s state driven by the actions of its agents. While such a paradigm enables users to play a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with prompts specified as a set of natural language actions and desired states. The result—a Promptable Game Model (PGM)—makes it possible for a user to play the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the director’s mode, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning “game AI,” encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state-of-the-art. Our framework, data, and models are available at snap-research.github.io/promptable-game-models.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Recent video generation methods, thanks to their training on extensive web-scale datasets [Schuhmann et al. 2022], exhibit a remarkable capacity for generating a vast amount of different concepts and scenes [Ho et al. 2022a; Singer et al. 2022; Blattmann et al. 2023]. Despite this, their generic nature hinders their comprehension of the dynamics of the modeled scenes. When generating or editing videos of a game, such as a tennis match, this limitation impedes their ability to attain precise control of the player movements or to devise optimal strategies to reach desired states of the game, such as victory over the opponent.

Neural video game simulators, a growing category of video generation methods, make an important step in this direction by focusing on modeling the dynamics of an environment, often a sports or computer game, with high fidelity and degree of control, and show that annotated videos can be used to learn to generate videos interactively [Kim et al. 2020; , 2021; Menapace et al. 2021; Huang et al. 2022; Davtyan and Favaro 2022] and build 3D environments where agents can be controlled through a set of discrete actions [Menapace et al. 2022]. However, when applied to complex or real-world environments, these works present several limitations: They do not accurately model the game’s dynamics, do not model physical interactions of objects in 3D space, do not learn precise controls, do not allow for high-level goal-driven control of the game flow, and, finally, do not model intelligent behavior of the agents, a capability often referred to as “game AI.”

In this work, we overcome these limitations by introducing game models trained on a set of annotated videos that support complex prompts. Due to the versatility of the applications enabled by diverse prompting methods (see Section 4), we call them Promptable Game Models (PGMs). More formally, we define PGMs as those models supporting a core set of game modeling and prompting functions including rendering from a controllable viewpoint, modeling of game’s dynamics, precise character control, high-level goal-driven control of the game, and game AI. Making a first step towards the realization of such models, we propose a framework that supports these characteristics.

To overcome the limitations of Davtyan and Favaro [2022], Huang et al. [2022], Kim et al. 2021, 2020, and Menapace et al. 2021, 2022, not only do we model the states of an environment, but we also consider detailed textual representations of the actions taking place in it. We argue that training on user commentaries describing detailed actions of a game greatly facilitates learning the dynamics of the game and game AI—important parts of PGMs—and that such commentaries are a key component in enabling a series of important model capabilities related to precise character control and high-level goal-driven control of the game flow.

In its simplest form, for games like tennis, this enables controlling each player in a precise manner with instructions such as “hit the ball with a backhand and send it to the right service box.” Moreover, language enables users to take the director’s mode and prompt the model with high-level game-specific scenarios or scripts, specified by means of natural language and desired states of the environment. As an example, given desired starting and ending states, our promptable game model can devise in-between scenarios that led to the observed outcome. Most interestingly, as shown in Figure 1, given the initial states of a real tennis video in which a player lost a point, our model prompted by the command “the [other] player does not catch the ball” can perform the necessary action to win the point.

Fig. 1.

Fig. 1. We propose Promptable Game Models (PGMs), controllable models of games that are learned from annotated videos. Our PGM enables the generation of videos using prompts, a wide spectrum of conditioning signals such as player poses, object locations, and detailed textual actions (see ) indicating what each player should do. Our Animation Model uses this information to generate future, past, or interpolated environment states according to the learned game dynamics. At this stage, the model is able to perform complex action reasoning such as generating a winning shot if the action “the [other] player does not catch the ball” is specified, as shown in the figure. To accomplish this goal, the model decides that the bottom player should hit the ball with a “lob” shot, sending the ball high above the opponent, who is unable to catch it. Our model renders the scene from a user-defined viewpoint (see ) using a Synthesis Model where the style of the scene (see ) can be controlled explicitly.

Broadly speaking, a game maintains states of its environments [Stanton et al. 2016; Starke et al. 2019; Curtis et al. 2022], renders them using a controllable camera, and evolves them according to user commands, actions of non-playable characters controlled by the game AI, and the game’s dynamics. Our framework follows this high-level structure highlighted in Figure 1. Our synthesis model maintains a state for every object and agent included in the game and renders them in the image space using the compositional NeRF of Menapace et al. [2022] followed by a learnable enhancer for superior rendering quality. To model the dynamics of games and game AI that determine the evolution of the environment states, we introduce an animation model. Specifically, inspired by Han et al. [2022], we train a non-autoregressive text-conditioned diffusion model that leverages masked sequence modeling to express the conditioning signals corresponding to a prompt. In particular, we show that using text labels describing actions happening in a game is instrumental in learning such capabilities. While certain prior work [Kim et al. 2020; , 2021; Menapace et al. 2021; , 2022] explored maintaining and rendering states of games, we are not aware of any generative method that attempts to enable precise control, modeling sophisticated goal-driven game dynamics, and learning game AI to the extent explored in this article.

The task of playing games and manipulating videos in the director’s mode has not been previously introduced in the literature. With this work, we attempt to introduce the task and set up a solid framework for future research. To do that, we collected two monocular video datasets. The first one is the Minecraft dataset containing 1.2 hours of videos, depicting a player moving in a complex environment. The second is a large-scale real-world dataset with 15.5 hours of high-resolution professional tennis matches. For each frame in these datasets, we provide accurate camera calibration, 3D player poses, ball localization, and, most importantly, diverse and rich text descriptions of the actions performed by each player in each frame.

In summary, our work brings the following contributions:

A framework for the creation of Promptable Game Models. It supports detailed offline rendering of high-resolution, high-frame rate videos of scenes with articulated objects from controllable viewpoints. It can generate actions specified by detailed text prompts, model opponents, and perform goal-driven generation of complex action sequences. As far as we are aware, no existing work provides this set of capabilities under comparable data assumptions.

A synthesis model, based on a compositional NeRF backed by an efficient plane- and voxel-based object representation that operates without upsampling. With respect to the upsampler-based approach of Menapace et al. [2022], it doubles the output resolution, can synthesize small objects, and does not present checkerboard upsampling artifacts.

An animation model, based on a text-conditioned diffusion model with a masked training procedure, which is key to supporting complex game dynamics, object interactions, game AI, and understanding detailed actions. It unlocks applications currently out of reach of state-of-the-art neural video game simulators (see Section 4).

A large-scale 15 h Tennis and a 1 h Minecraft video datasets with camera calibration, 3D player poses, 3D ball localization, and detailed text captions.

Skip 2RELATED WORK Section

2 RELATED WORK

Our Promptable Game Model relates to neural game simulation literature, game engines, character animation, neural rendering, sequential data generation, and text-based generation. We review the most recent related works in this section.

2.1 Neural Video Game Simulation

In the past few years, video game simulation using deep neural networks has emerged as a new research trend [Kim et al. 2020; , 2021; Menapace et al. 2021; , 2022; Davtyan and Favaro 2022; Huang et al. 2022]. The objective is to train a neural network to synthesize videos based on a specific type of prompt: a sequence of actions provided at every timestep.

This problem was first addressed using training videos annotated with the corresponding action labels at each timestep [Oh et al. 2015; Chiappa et al. 2017; Kim et al. 2020]. They consider a discrete action representation that is difficult to define a priori for real-world environments. More recently, Kim et al. [2021] proposed a framework that uses a continuous action representation to model real-world driving scenarios. Devising a good continuous action representation for an environment, however, is complex. To avoid this complexity, Menapace et al. Menapace et al. [2021]; , 2022], propose to learn a discrete action representation. Huang et al. [2022] expand on this idea by modeling actions as a learned set of geometric transformations, while Davtyan and Favaro [2022] represent actions by separating them into a global shift component and a local discrete action component.

Differently from our PGM, previous works perform generation in an autoregressive manner, conditioned on the actions and, therefore, are unable to answer prompts entailing constraint- or goal-driven generation for which non-sequential conditioning is necessary. We find the proposed text-based action representation and masked training procedure to be crucial to unlocking such applications.

Among these works, Playable environments [Menapace et al. 2022] is the most closely related to ours. Rather than employing a 2D model, they use a NeRF-based renderer [Mildenhall et al. 2020] that enables them to represent complex 3D scenes. We follow this high-level design but introduce a more efficient plane- and voxel-based NeRF representation that enables the rendering of outputs at double the original resolution without the use of upsampling modules, which we found to be the cause of checkerboard artifacts, failures in rendering of small objects, and to be prone to failure when training at higher resolutions. In addition, the employed discrete action representation shows limitations in complex scenarios such as tennis, where it is only able to capture the main movement directions of the players and does not model actions such as ball hitting. In contrast, we employ a text action representation that specifies actions at a fine level of granularity (i.e., which particular ball-hitting action is being performed and where the ball is sent), while remaining interpretable and intuitive for the user. Last, we replace the adversarially trained LSTM animation module with a more capable masked diffusion transformer.

2.2 Game Engines

Game engines brought a revolution to game development by providing extensible and reusable software that can be employed to create a wide range of game models [Gregory 2018]. Nowadays, a range of game engines exists (Unity, Unreal, id Tech, Source, CRYENGINE, Frostbite, RAGE) and have grown to become vast software ecosystems. Modern game engines are organized into components including a rendering engine [Müller et al. 2020], a resource manager, a module for physics and collision, an animation manager, and, importantly, a gameplay foundation system that models the game rules and encapsulates game AI functionalities [Gregory 2018]. The presence of these components, coupled with the labor of a range of trained experts including software engineers, artists (animators, 3D modelers, texture and lighting artists), and game developers, enables the construction of sophisticated game models supporting low-level character control and scripted agent behavior. We show that monocular videos annotated with a fraction of the effort (see Supplement I.1) can be used to learn models of games that support answering challenging prompts related to agent intelligence, a capability difficult to achieve through scripted agent behavior.

2.3 Character Animation

Character animation is a long-standing problem in computer graphics. Several recent methods have been proposed that produce high-quality animations. Holden et al. [2020] propose a learnable version of Motion Matching [Büttner and Clavet 2015] that formulates character animation as retrieval of the closest motion from a motion database and supports interaction with other characters or objects. Other approaches model the evolution of characters using time series models conditioned on the preceding state and control signals [Starke et al. 2019; , 2020; Lee et al. 2018; Holden et al. 2017]. Starke et al. [2019] propose a model based on a mixture of experts that controls character locomotion and object interactions; in a follow-up work [Starke et al. 2020] they introduce local motion phases to model complex character motions and interaction with a second character.

To produce high-quality animations the methods rely on difficult-to-acquire motion capture data enriched with contact information [Holden et al. 2020; Starke et al. 2019; , 2020], motion phases [Starke et al. 2019], or engineered action labels [Starke et al. 2019; , 2020]. Additionally, handcrafted dataset-specific feature representations and mappings from user controls to such representations are often leveraged, and additional knowledge is injected through postprocessing steps such as inverse kinematics or external physics models. While these assumptions promote high-quality outputs, they come at a significant effort. In contrast, our method sidesteps these requirements by not using motion capture and basing user control on natural language that is cheaper to acquire (see Supplement I.1) and does not require manual engineering. Finally, character animation methods support limited goal-driven control such as interacting with a specific object while avoiding collisions [Starke et al. 2019]. In contrast, our method models complex game AI tasks such as modeling strategies to defeat the opponent, which are instrumental in answering complex user prompts.

2.4 Neural Rendering

Neural rendering was recently revolutionized by the advent of NeRF [Mildenhall et al. 2020]. Several modifications of the NeRF framework were proposed to model deformable objects [Tretschk et al. 2021; Park et al. 2021a; Weng et al. 2022; Park et al. 2021b; Li et al. 2022] and decomposed scene representations [Menapace et al. 2022; Niemeyer and Geiger 2021; Müller et al. 2022; Ost et al. 2021; Kundu et al. 2022]. In addition, several works improved the efficiency of the original MLP representation of the radiance field [Mildenhall et al. 2020] by employing octrees [Yu et al. 2021; Martel et al. 2021], voxel grids [Fridovich-Keil et al. 2022], triplanes [Chan et al. 2022], hash tables [Müller et al. 2022], or factorized representations [Chen et al. 2022].

Our framework is most related to that of Weng et al. [2022], since we model player deformations using an articulated 3D prior and linear blend skinning (LBS) [Lewis et al. 2000]. Differently from them, however, we consider scenes with multiple players and apply our method to articulated objects with varied structures for their kinematic trees. While similar to the rendering framework of Menapace et al. [2022], our framework does not adopt computationally inefficient MLP representations, using voxel [Fridovich-Keil et al. 2022] or plane representations instead, thus does not rely on upsampler networks.

2.5 Sequential Data Generation with Diffusion Models

In prior work, sequential data generation was mainly addressed with auto-regressive formulations combined with adversarial [Kwon and Park 2019] or variational [Fortuin et al. 2020; Babaeizadeh et al. 2018] generative models. Recently, diffusion models have emerged as a promising solution to this problem, leading to impressive results in multiple applications such as audio [Kong et al. 2020; Lam et al. 2022; Chen et al. 2021; Leng et al. 2022] and video synthesis [Ho et al. 2022b; , 2022a; Singer et al. 2022; Blattmann et al. 2023], language modeling [Dieleman et al. 2022], and human motion synthesis [Zhang et al. 2022; Dabral et al. 2023]. Following this methodological direction [Tashiro et al. 2021], introduce a score-based diffusion model for imputing missing values in time series. They introduce a training procedure based on masks that simulate missing data. This approach motivates our choice of a similar masking strategy to model the conditions entailed by the given prompt and generate the unknown environment states. In this work, we show that mask-based training is highly effective in modeling geometric properties together with textual data modalities.

2.6 Text-based Generation

In recent years, we have witnessed the emergence of works on the problem of text-based generation. Several works address the problem of generating images [Saharia et al. 2022; Rombach et al. 2021; Ramesh et al. 2021; , 2022] and videos with arbitrary content [Ho et al. 2022b; , 2022a; Singer et al. 2022; Hong et al. 2022], and arbitrary 3D shapes [Jain et al. 2022; Lin et al. 2023; Achlioptas et al. 2023].

Han et al. [2022] introduced a video generation framework that can incorporate various conditioning modalities in addition to text, such as segmentation masks or partially occluded images. Their approach employs a frozen RoBERTa [Liu et al. 2020] language model and a sequence masking technique. Fu et al. [2023] propose an analogous framework. Our animation framework employs a similar masking strategy, but we model text conditioning at each timestep in the sequence, use diffusion models that operate on continuous rather than discrete data, and generate scenes that can be rendered from arbitrary viewpoints.

More relevant to our work, several papers introduced models to generate human motion sequences from text [Tevet et al. 2022; Athanasiou et al. 2022]. Recently, diffusion models have shown strong performance on this task [Zhang et al. 2022; Dabral et al. 2023]. In these works, sequences of human poses are generated by a diffusion model conditioned on the output of a frozen CLIP text encoder. It is worth noting that these prior works model only a single human, while our framework supports multiple human agents and objects and models their interactions with the environment.

Skip 3METHOD Section

3 METHOD

This section introduces our framework for the creation of Promptable Game Models that allows the user to perform a range of dynamic scene editing tasks, formulated as a set of conditioning prompts.

We divide our PGM into two modules: a synthesis model and an animation model. The synthesis model generates an image given the representation of the environment state. The animation model, instead, aims at modeling the game’s dynamics, with player actions and interactions, in the high-level space of the environment states. Actions are modeled as text, which is an expressive, yet intuitive form of control for a wide range of tasks. The overview of our framework is provided in Figure 2(a).

Fig. 2.

Fig. 2. (a) Overview of our framework. The animation model produces states \(\mathbf {s}\) based on user-provided conditioning signals, or prompts, \(\mathbf {s}^c, \mathbf {a}^c\) that are rendered by the synthesis model. (b) The diffusion-based animation model predicts noise \(\boldsymbol {\epsilon }_k\) applied to the noisy states \(\mathbf {{s}}^p_k\) conditioned on known states \(\mathbf {s}^c\) and actions \(\mathbf {a}^c\) with the respective masks \(\mathbf {m}^\mathbf {s}, \mathbf {m}^\mathbf {a}\) , diffusion step k and framerate \(\nu\) . The text encoder \(\mathcal {T}\) produces embedding for the textual actions, while the temporal model \(\mathcal {A}\) performs noise prediction. (c) The synthesis model renders the current state using a composition of neural radiance fields, one for each object. A style encoder \(\mathcal {E}\) extracts the appearance \(\boldsymbol {\omega }\) of each object. Each object is represented in its canonical pose by \(\mathcal {C}\) , and deformations of articulated objects are modeled by the deformation model \(\mathcal {D}\) . After integration and composition, the feature grid \(\mathbf {G}\) is rendered to the final image using the feature enhancer \(\mathcal {F}\) .

In more detail, our model defines the state of the entire environment as the combination of all individual object states. Consequently, each individual state is the set of the object properties such as the position of each object in the scene, their appearance, or their pose. Formally, the environment state at time t can be represented by \(\mathbf {s}_t\in \mathcal {\mathbb {S}}=(\mathbb {R}^{n_1} \times \cdots \times \mathbb {R}^{n_{P}})\), P properties of variable length \(n_i\) defined as the union of the properties of each object. This state representation captures all variable aspects of each object in the environment, thus it can be used by the synthesis model to generate the scene.

However, the animation model predicts the evolution of an environment in time, which is represented by the sequence of its states \(\lbrace \mathbf {s}_1,\mathbf {s}_2, \dots \mathbf {s}_{T}\rbrace =\mathbf {s}\in \mathcal {\mathbb {S}}^{T}\), where T is the length of the sequence. The model provides control over sequence generation with the help of user-defined conditioning signals, or prompts, that can take two forms: explicit state manipulation and high-level text-based editing. With respect to the former, the user could change the position of the tennis ball at timestep t, and the model would automatically adapt the position of the ball in other nearby states. As far as the latter is concerned, users could provide high-level text-based values of actions such as “The player takes several steps to the right and hits the ball with a backhand” and the model would generate the corresponding sequence of states (see Figure 3). These generic actions in the form of text are central to enabling high-level, yet fine-grained control over the evolution of the environment. To train our framework, we assume a dataset of camera-calibrated videos, where each video frame is annotated with the corresponding states \(\mathbf {s}\) and actions \(\mathbf {a}\).

Fig. 3.

Fig. 3. Different sequences predicted on the Tennis and Minecraft datasets starting from the same initial state and altering the text conditioning. Our model moves players and designates shot targets using domain-specific referential language (e.g., “right service box,” “no man’s land,” “baseline”). The model supports fine-grained control over the various tennis shots using technical terms (e.g., “forehand,” “backhand,” “volley”).

3.1 Synthesis Model

In this section, we describe the synthesis model that renders states from controllable viewpoints (see Figure 2(c)). We build our model based on a compositional NeRF [Menapace et al. 2022] framework that enables explicit control over the camera and represents a scene as a composition of different, independent objects. Thanks to the independent representation of objects, each object property is directly linked to an aspect of the respective object and can thus be easily controlled and manipulated. The compositional NeRF framework allows different, specialized NeRF architectures to be used for each object based on its type. To further improve quality, rather than directly rendering RGB images with the NeRF models, we render features and make use of a feature enhancer CNN to produce the RGB output. To represent objects with different appearances, we condition the NeRF and enhancer models on the style codes extracted with a dedicated style encoder [Menapace et al. 2022]. Our model is trained using reconstruction as the main guiding signal.

In Sections 3.1.13.1.6, we illustrate the main components of the synthesis module, and in Section 3.1.7, we describe the training procedure.

3.1.1 Scene Composition with NeRFs.

Neural radiance fields represent a scene as a radiance field, a 5D function parametrized as a neural network mapping the current position \(\mathbf {x}\) and viewing direction \(\mathbf {d}\) to density \(\mathbf {\sigma }\) and radiance \(\mathbf {c}\).

To allow controllable generation of complex scenes, we adopt a compositional strategy where each object in the scene is modeled with a dedicated NeRF model [Menapace et al. 2022; Müller et al. 2022; Xu et al. 2022]. The scene is rendered by sampling points independently for each object and querying the respective object radiance field \(\mathcal {C}_i\). The results for all objects are then merged and sorted by distance from the camera before being integrated.

All objects are assumed to be described by a set of properties whose structure depends on the type of object, e.g., a player, the ball, the background. We consider the following properties:

Object location. Each object is contained within an axis-aligned bounding box \(\mathbf {b}^\mathrm{3D}_i\), which is defined by size and position. In the case of the ball, we additionally consider its velocity to model blur effects (Section 3.1.6).

Object style. All objects have an appearance that may vary in different sequences, thus we introduce a style code \(\boldsymbol {\omega }_i\) as an additional property for all objects. Since it is difficult to define such style information a priori, we assume it to be a latent variable and learn it jointly during training.

Object pose. Articulated objects such as humans require additional properties to model varying poses. We model the deformation of articulated objects as a kinematic tree with \({J}_i\) joints and consider as object properties the rotation \(\mathbf {R}\) and translation \(\mathbf {tr}\) parameters associated with each joint (Section 3.1.4).

From now on, we drop the object index i to simplify notation.

3.1.2 Style Encoder.

Representing the appearance of each object is challenging, since it changes based on the type of object and illumination conditions. We treat the style \(\boldsymbol {\omega }\) for each object as a latent variable that we regress using a convolutional style encoder \(\mathcal {E}\). Given the current video frame \(\mathbf {I}\) with O objects, we compute 2D bounding boxes \({\mathbf {b}^\mathrm{2D}}\) for each object. First, a set of residual blocks is used to extract frame features that are later cropped around each object according to \({\mathbf {b}^\mathrm{2D}}\) using RoI pooling [Girshick et al. 2013]. Later, a series of convolutional layers with a final projection is used to predict the style code \(\boldsymbol {\omega }\) from the cropped feature maps.

3.1.3 Volume Modeling for Efficient Sampling.

Radiance fields are commonly parametrized using MLPs [Mildenhall et al. 2020] but such representation requires a separate MLP evaluation for each sampled point, making it computationally challenging to train high-resolution models. To overcome such issue, we model the radiance field \(\mathcal {C}\) of each object in a canonical space using two alternative parametrizations.

For three-dimensional objects, we make use of a voxel grid parametrization [Fridovich-Keil et al. 2022; Weng et al. 2022]. Starting from a fixed noise tensor \(\mathbf {V}^{\prime }\in \mathbb {R}^{F^{\prime }\times H^{\prime }_{V}\times W^{\prime }_{V}\times D^{\prime }_{V}}\), a series of 3D convolutions produces a voxel \(\mathbf {V}\in \mathbb {R}^{F+ 1 \times H_{V}\times W_{V}\times D_{V}}\) containing the features and density associated to each point in the bounded space. Here, \(F^{\prime }\) and F represent the number of features, while \(H_{V}\), \(W_{V}\), and \(D_{V}\) represent the size of the voxel. Given a point in the object canonical space \(\mathbf {x}_c\), the associated features and density \(\mathbf {\sigma }\) are retrieved using trilinear sampling on \(\mathbf {V}\). To model the different appearance of each object, we adopt a small MLP conditioned on the style \(\boldsymbol {\omega }\) to produce a stylized feature with the help of weight demodulation [Karras et al. 2020].

For two-dimensional objects such as planar scene elements, we make use of a similar parametrization where a fixed 2D noise tensor \(\mathbf {P}^{\prime }\in \mathbb {R}^{F^{\prime }\times H^{\prime }_{P}\times W^{\prime }_{P}}\) is mapped to a plane of features \(\mathbf {P}\in \mathbb {R}^{F\times H_{P}\times W_{P}}\) using a series of 2D convolutions. Given a ray r, we compute the intersection point \(\mathbf {x}\) between the plane and the ray that is used to sample \(\mathbf {P}\) using bilinear sampling. Similarly to the voxel case, a small MLP is used to model object appearance according to \(\boldsymbol {\omega }\). We assume planes to be fully opaque and assign a fixed density value \(\mathbf {\sigma }\) to each sample. Thanks to this representation, a single point per ray is sufficient to render the object.

3.1.4 Deformation Modeling.

Since the radiance field \(\mathcal {C}\) alone supports only rendering of rigid objects expressed in a canonical space, to render articulated objects such as humans, we introduce a deformation model \(\mathcal {D}\). Given an articulated object, we assume its kinematic tree is known and that the transformation \([\mathbf {R}_j|\mathbf {tr}_j]\) from each joint \(j \in 1, \ldots ,J\) to the parent joint is part of the object’s properties. We then implement a deformation procedure based on linear blend skinning (LBS) [Lewis et al. 2000] and inspired by HumanNeRF [Weng et al. 2022], which employs the joint transformations and a learned volume of blending weights \(\mathbf {W}\in \mathbb {R}^{J+ 1 \times H_{W}\times W_{W}\times D_{W}}\) to associate each point in the bounding box of the articulated object to the corresponding one in the canonical volume. We present additional details in Supplement C.

3.1.5 Enhancer.

NeRF models are often parametrized to output radiance \(\mathbf {c}\in \mathbb {R}^3\) and directly produce an image. However, we find that such approach struggles to produce correct shading of the objects, with details such as shadows being difficult to synthesize. Also, to improve the computational efficiency of the method, we sample a limited number of points per ray that may introduce subtle artifacts in the geometry. To address these issues, we parametrize the model \(\mathcal {C}\) to output features where the first three channels represent radiance and the subsequent represent learnable features. Then, we produce a feature grid \(\mathbf {G}\in \mathbb {R}^{F\times H\times W}\) and an RGB image \(\mathbf {\tilde{I}}\in \mathbb {R}^{3 \times H\times W}\). We introduce an enhancer network \(\mathcal {F}\) modeled as a UNet [Ronneberger et al. 2015] architecture interleaved with weight demodulation layers [Karras et al. 2020] that maps \(\mathbf {G}\) and the style codes \(\boldsymbol {\omega }\) to the final RGB output \(\mathbf {\hat{I}}\in \mathbb {R}^{3 \times H\times W}\).

3.1.6 Object-specific Rendering.

Our compositional approach allows the use of object-specific techniques. In particular, in the case of tennis, we detail in Supplement B how we can apply dedicated procedures to enhance the rendering quality of the ball, the racket, and the 2D user interfaces such as the scoreboards. The rendering of the tennis ball is treated specially to render the blur that occurs in real videos in the case of fast-moving objects. The racket can be inserted in a post-processing stage to compensate for the difficulty of NeRFs to render thin, fast-moving objects. Finally, the UI elements are removed from the scene, since they do not behave in a 3D-consistent manner. For Minecraft, we describe how the scene skybox is modeled.

3.1.7 Training.

We train our model using reconstruction as the main driving signal. Given a frame \(\mathbf {I}\) and reconstructed frame \(\mathbf {\hat{I}}\), we use a combination of L2 reconstruction loss and the perceptual loss of Johnson et al. [2016] as our training loss. To minimize the alterations introduced by the enhancer and improve view consistency, we impose the same losses between \(\mathbf {I}\) and \(\mathbf {\tilde{I}}\), the output of the model before the feature enhancer. All losses are summed without weighting to produce the final loss term. To minimize GPU memory consumption, instead of rendering full images, we impose the losses on sampled image patches instead [Menapace et al. 2022].

We train all the components of the synthesis model jointly using Adam [Kingma and Ba 2015] for 300k steps with batch size 32. We set the learning rate to \(1e-4\) and exponentially decrease it to \(1e-5\) at the end of training. The framework is trained on videos with 1024 \(\times\) 576 px resolution. We present additional details in Supplement D.1 and in Supplement E.1 and discuss inference details in Supplement F.

3.2 Animation Model

In this section, we describe the animation model (see Figure 2(b)), whose task is that of generating sequences of states \(\mathbf {s}\in \mathcal {\mathbb {S}}^{T}\) according to user inputs. The animation model allows users to specify conditioning signals, or prompts, in two forms. First, conditional signals can take the form of values that the user wants to impose on some object properties in the sequence, such as the player position at a certain timestep. This signal is represented by a sequence \(\mathbf {s}^c\in \mathcal {\mathbb {S}}^{T}\). This form of conditioning allows fine control over the sequence to generate but requires directly specifying values of properties. Second, to allow high-level, yet granular control over the sequence, we introduce actions in the form of text \(\mathbf {a}^c\in {\mathbb {L}}^{A\times T}\) that specify the behavior of each of the A actionable objects at each timestep in the sequence, where \(\mathbb {L}\) is the set of all strings of text. To maximize the flexibility of the framework, we consider all values in \(\mathbf {s}^c\) and \(\mathbf {a}^c\) to be optional, thus we introduce their respective masks \(\mathbf {m}^\mathbf {s}\in \lbrace 0,1\rbrace ^{P\times T}\) and \(\mathbf {m}^\mathbf {a}\in \lbrace 0,1\rbrace ^{A\times T}\) that are set to 1 when the respective conditioning signal is present. We assume elements where the mask is not set to be equal to 0. The animation model predicts \(\mathbf {s}^p\in \mathcal {\mathbb {S}}^{T}\) conditioned on \(\mathbf {s}^c\) and \(\mathbf {a}^c\) such that: (1) \(\begin{equation} \mathbf {s}= \mathbf {s}^p+ \mathbf {s}^c, \end{equation}\) where we consider the entries in \(\mathbf {s}^p\) and \(\mathbf {s}^c\) to be mutually exclusive, i.e., an element of \(\mathbf {s}^p\) is 0 if the corresponding conditioning signal in \(\mathbf {s}^c\) is present according to \(\mathbf {m}^\mathbf {s}\). Note that the prediction of actions is not necessary, since \(\mathbf {s}\) is sufficient for rendering.

We adopt a temporal model \(\mathcal {A}\) based on a non-autoregressive masked transformer design and leverage the knowledge of a pretrained language model in a text encoder \(\mathcal {T}\) to model action conditioning information [Han et al. 2022]. The masked design provides support for the optional conditioning signals and is trained using masked sequence modeling, where we sample \(\mathbf {m}^\mathbf {s}\) and \(\mathbf {m}^\mathbf {a}\) according to various strategies that emulate desired inference tasks.

In Section 3.2.1, we define our text encoder, Section 3.2.2 defines the diffusion backbone, and in Section 3.2.3, we describe the training procedure.

3.2.1 Text Encoder.

We introduce a text encoder \(\mathcal {T}\) that encodes textual actions into a sequence of fixed-size text embeddings: (2) \(\begin{equation} \mathbf {a^\mathrm{emb}}= \mathcal {T}(\mathbf {a}^c) \in \mathbb {R}^{A\times T\times N_t}, \end{equation}\) where \(N_t\) is the size of the embedding for the individual sentence. Given a textual action, we leverage a pretrained T5 text model [Raffel et al. 2022] \(\mathcal {T}_\mathrm{enc}\) that tokenizes the sequence and produces an output feature for each token. Successively, a feature aggregator \(\mathcal {T}_\mathrm{agg}\) modeled as a transformer encoder [Vaswani et al. 2017] produces the aggregated text embedding from the text model features. To retain existing knowledge into \(\mathcal {T}_\mathrm{enc}\), we keep it frozen and only train the feature aggregator \(\mathcal {T}_\mathrm{agg}\).

3.2.2 Temporal Modeling.

In this section, we introduce the temporal model \(\mathcal {A}\) that predicts the sequence of states \(\mathbf {s}\) conditioned on known state values \(\mathbf {s}^c\), action embeddings \(\mathbf {a^\mathrm{emb}}\), and the respective masks \(\mathbf {m}^\mathbf {s}\) and \(\mathbf {m}^\mathbf {a}\). Since only unknown state values need to be predicted, the model predicts \(\mathbf {s}^p\) and the complete sequence of states is obtained as \(\mathbf {s}= \mathbf {s}^p+ \mathbf {s}^c\), following Equation (1). Diffusion models have recently shown state-of-the-art performance on several tasks closely related to our setting such as sequence modeling [Tashiro et al. 2021] and text-conditioned human motion generation [Dabral et al. 2023; Zhang et al. 2022]. Thus, we follow the DDPM [Ho et al. 2020] diffusion framework, and we frame the prediction of \(\mathbf {s}^p=\mathbf {s}^p_0\) as a progressive denoising process \(\mathbf {s}^p_0, \ldots ,\mathbf {s}^p_K\), where we introduce the diffusion timestep index \(k\in {0, \ldots ,K}\). The temporal model \(\mathcal {A}\) acts as a noise estimator that predicts the Gaussian noise \(\boldsymbol {\epsilon }_k\) in the noisy sequence of unknown states \(\mathbf {{s}}^p_k\) at diffusion timestep k: (3) \(\begin{equation} \boldsymbol {\epsilon }^p_k= \mathcal {A}\left(\mathbf {{s}}^p_k|\mathbf {s}^c,\mathbf {a^\mathrm{emb}},\mathbf {m}^\mathbf {s},\mathbf {m}^\mathbf {a},k\right)\!. \end{equation}\) An illustration of the proposed diffusion model is shown in Figure 2(b).

We realize \(\mathcal {A}\) using a transformer encoder [Vaswani et al. 2017]. To prepare the transformer’s input sequence, we employ linear projection layers \(\mathcal {P}\) with separate parameters for each object property. Since corresponding entries in \(\mathbf {{s}}^p_k\) and \(\mathbf {s}^c\) are mutually exclusive, we only consider the one that is present as input to the transformer, and we employ different projection parameters to enable the model to easily distinguish between the two. An analogous projection is performed for \(\mathbf {a^\mathrm{emb}}\) and, subsequently, the projection outputs for states and actions are concatenated into a single sequence \(\mathbf {e}\in \mathbb {R}^{P+A\times T\times E}\), which constitutes the input to the transformer. An output projection layer with separate weights for each object property produces the prediction \(\boldsymbol {\epsilon }^p_k\) at the original dimensionality. To condition the model on the diffusion timestep k, we introduce a weight demodulation layer [Karras et al. 2020] after each self-attention and feedforward block [Zhang et al. 2022].

To model long sequences while keeping reasonable computational complexity and preserving the ability to model long-term relationships between sequence elements, it is desirable to build the sequences using states sampled at a low framerate. However, this strategy would not allow the model to generate content at the original framerate and would prevent it from understanding dynamics such as limb movements that are clear only when observing sequences sampled at high framerates. To address this issue, we use the weight demodulation layers to further condition our model on the sampling framerate \(\nu\) to enable a progressive increase of the framerate at inference time (see Supplement F.2.1).

3.2.3 Training.

To train our model, we sample a sequence \(\mathbf {s}\) with corresponding actions \(\mathbf {a}\) from a video in the dataset at a uniformly sampled framerate \(\nu\). Successively, we obtain masks \(\mathbf {m}^\mathbf {s}\) and \(\mathbf {m}^\mathbf {a}\) according to masking strategies we detail in Supplement E.2. The sequences for training are obtained following \(\mathbf {s}^p_0 = \mathbf {s}\odot (1-\mathbf {m}^\mathbf {s})\) and \(\mathbf {s}^c= \mathbf {s}\odot \mathbf {m}^\mathbf {s}\), and actions as \(\mathbf {a}^c= \mathbf {a}\odot \mathbf {m}^\mathbf {a}\), where \(\odot\) denotes the Hadamard product.

We train our model by minimizing the DDPM [Ho et al. 2020] training objective: (4) \(\begin{equation} \mathbb {E}_{k\sim \mathcal {U}(1,K),\epsilon \sim \mathcal {N}(0,I)}||\boldsymbol {\epsilon }^p_k- \boldsymbol {\epsilon }_k||, \end{equation}\) where \(\boldsymbol {\epsilon }^p_k\) is the noise estimated by the temporal model \(\mathcal {A}\) according to Equation (3). Note that the loss is not applied to positions in the sequence corresponding to conditioning signals [Tashiro et al. 2021].

Our model is trained using the Adam [Kingma and Ba 2015] optimizer with a learning rate of \(1e-4\), cosine schedule, and with 10k warmup steps. We train the model for a total of 2.5M steps and a batch size of 32. We set the length of the training sequences to \(T=16\). The number of diffusion timesteps is set to \(K=1,\!000\) and we adopt a linear noise schedule [Ho et al. 2020]. Additional details are presented in Supplement D.2 and Supplement F.

Skip 4APPLICATIONS Section

4 APPLICATIONS

Our framework enables a series of applications that are unlocked by its expressive state representation, the possibility to render it using a 3D-aware synthesis model, and the ability to generate sequences of states with an animation model that understands the game dynamics and can be conditioned on a wide range of signals. In the following, we demonstrate a set of selected applications.

Our state representation is modular, where the style is one of the components. Style swapping is enabled by swapping the style of the desired object \(\boldsymbol {\omega }\) in the original image with the one from a target image. Similarly to a traditional game engine, our synthesis model renders the current state of the environment from a user-defined perspective. This enables our model to perform novel view synthesis. We show in Supplement G examples of both these capabilities.

We now show a set of applications enabled by the animation model. In Figure 3, we show results for generating different sequences using textual actions starting from a common initial state. Thanks to the textual action representation, it is possible to gain fine control over the generated results and to make use of referential language.

Our animation model, however, is not limited to generate sequences given step-by-step actions. Thanks to its understanding of the game’s dynamics, the model can tackle more complex tasks such as modeling an opponent against which a user-controlled player can play (see Figure 4), or even controlling all players without user intervention (see Figure 5) in a way similar to a “game AI.”

Fig. 4.

Fig. 4. Sequences generated by specifying actions for one of the players and letting the model act as the game AI and take control of the opponent. The game AI successfully responds to the actions of the player by running to the right (see top sequence) or towards the net (see bottom sequence), following two challenging shots of the user-controlled player.

Fig. 5.

Fig. 5. Sequences generated without any user conditioning signal. The actions of all players are controlled by the model that acts as the game AI. In tennis, the players produce a realistic exchange, with the bottom player advancing aggressively toward the net and the top player defeating him with a shot along the right sideline. The Minecraft player and tennis ball trajectories are highlighted for better visualization.

The animation model also unlocks the “director’s mode,” where the user can generate sequences by specifying prompts consisting in a desired set of high-level constraints or goals. The model is able to reason on actions to find a solution satisfying the given constraints. As a first example, Figure 6 demonstrates results for a navigation problem, where the user specifies a desired initial and final player position in the scene, and the model devises a path between them. Notably, the user can also constrain the solution on intermediate waypoints by means of natural language. As a second example, Figure 7 shows that the model is capable of devising strategies to defeat an opponent. Given an original sequence where the player commits a mistake and loses, the model can devise which actions the player should have taken to win. Notably, these model capabilities are learned by just observing sequences annotated with textual actions.

Fig. 6.

Fig. 6. Given an initial and a final state, we generate all the states in between. We repeat the generation multiple times, conditioning it using different actions indicating the desired intermediate waypoints.

Fig. 7.

Fig. 7. Given a sequence where the bottom player loses (see top), we ask the model to modify it such that the bottom player wins instead (see bottom). To do so, we condition the top player on the action “The player does not catch the ball.” While in the original sequence the bottom player aims its response to the center of the field where the opponent is waiting, the model now successfully generates a winning set of moves for the bottom player that sends the ball along the left sideline, too far for the top player to reach.

Skip 5EVALUATION Section

5 EVALUATION

In this section, we introduce our Tennis and Minecraft datasets (Section 5.1), describe our experimental protocol (Section 5.2), and perform evaluation of both the synthesis model (Section 5.3) and the animation model (Section 5.4). Additional evaluation results are shown in Supplement H.

5.1 Datasets

We collect two datasets to evaluate our method. Both datasets and the employed data collection tools are publicly available. In the following, we describe their structure and the available annotations.

5.1.1 Tennis Dataset.

We collect a dataset of broadcast tennis matches starting from the videos in Menapace et al. [2022]. The dataset depicts matches between two professional players from major tennis tournaments, captured with a single, static bird’s eye camera.

To enable the construction of PGMs, we collect a wide range of annotations with a combination of manual and automatic methods (see Supplement A.1):

For each frame, we perform camera calibration.

For each of the two players, we perform tracking and collect full SMPL [Loper et al. 2015] body parameters. Note that, in our work, we only use a subset of the parameters: rotation and translation associated with each joint and the location of the root joint in the scene.

For each player and frame, we manually annotate textual descriptions of the action being performed. We structure captions so each includes information on where and how the player is moving, the particular type of tennis shot being performed, and the location where the shot is aimed (see Supplement A.4). Captions make use of technical terms to describe shot types and field locations. In contrast to other video-text datasets that contain a single video-level [Bain et al. 2021] or high-level action descriptions weakly aligned with video content [Miech et al. 2019], the captions in our dataset are separate for each object and constitute a fine-grained description of the actions taking place in the frame.

For the ball, we perform 3D tracking and provide its position in the scene and its velocity vector indicating the speed and direction of movement.

We collect 7,112 video sequences in 1920 \(\times\) 1080 px resolution and 25 fps starting from the videos in Menapace et al. [2022] for a total duration of 15.5 h. The dataset features 1.12M fully annotated frames and 25.5k unique captions with 915 unique words. We highlight key statistics of the dataset and show samples in Supplement A.

We note that broadcast Tennis videos are monocular and do not feature camera movements other than rotation, thus the dataset does not make it possible to recover the 3D geometry of static objects [Menapace et al. 2022].

5.1.2 Minecraft Dataset.

We collect a synthetic dataset from the Minecraft video game. This dataset depicts a player performing a series of complex movements in a static Minecraft world that includes walking, sprinting, jumping, and climbing on various world structures such as platforms, pillars, stairs, and ladders. A single, monocular camera that slowly orbits around the scene center is used to capture the scenes. We collect a range of synthetic annotations using a game add-on we develop starting from ReplayMod [2022]:

Camera calibration for each frame.

Player rotation and translation parameters associated with each joint in the Minecraft kinematic tree format and the location of the root joint in the scene (see Supplement A.2).

A synthetically generated text caption describing the action being performed by the player. We assign varied, descriptive names to each element of the scene and build captions that describe scene elements or directions towards which the player is moving. Additionally, our captions capture how movement is happening, i.e., by jumping, sprinting, walking, climbing, or falling. We adopt a stochastic caption generation procedure that generates multiple alternative captions for each frame.

A total of 61 videos are collected in 1024 \(\times\) 576 px resolution and 20 fps for a total duration of 1.21 h. The dataset contains 68.5k fully annotated frames and 1.24k unique captions with 117 unique words. We highlight key statistics for the dataset in Supplement A.

5.2 Evaluation Protocol

We evaluate the synthesis and the animation models separately, following a similar evaluation protocol. We divide the test dataset into non-overlapping sequences of 16 frames sampled at 5 fps and 4 fps, respectively, for the Minecraft and Tennis datasets and make use of the synthesis or animation model to reconstruct them. In the case of the synthesis model, we directly reconstruct the video frames and compute the following metrics:

LPIPS [Zhang et al. 2018] is a standard metric for evaluating the reconstruction quality of the generated images.

FID [Heusel et al. 2017] is a widely used metric for image generation quality.

FVD [Unterthiner et al. 2018] is a standard metric for assessing the quality of generated videos.

Average Detection Distance (ADD) [Menapace et al. 2021] measures the average distance in pixels between the bounding box centers of ground truth bounding boxes and bounding boxes obtained from the generated sequences through a pretrained detector [Ren et al. 2015].

Missing Detection Rate (MDR) [Menapace et al. 2021] estimates the rate of bounding boxes that are present in the ground truth but that are missing in the generated videos

For the animation model, we evaluate reconstruction of the object properties. Note that different strategies for masking affect the behavior of the model and the nature of the reconstruction task, thus we separately evaluate different masking configurations corresponding to different inference tasks. We compute metrics that address both the fidelity of the reconstruction and the realism of the produced sequences:

L2 computes the fidelity of the reconstruction by measuring the distance between the ground truth and reconstructed object properties along the sequence.

Fréchet Distance (FD) [Fréchet 1957] measures the realism of each object property by computing the Fréchet Distance between the distribution of real sequences of a certain object property and of generated ones.

We select different reconstruction tasks for evaluation:

Video prediction conditioned on actions consists in reconstructing the complete sequence starting from the initial state while the actions are specified for all timesteps. This setting corresponds to the evaluation setting of Menapace et al. [2022].

Unconditioned video prediction consists in reconstructing the complete sequence starting from the first state only.

Opponent modeling consists in reconstructing the object properties of an unknown player, based on the state of the other player, with actions specified only on the known player. Good performance in this task indicates the ability to model an opponent against which a user can play.

Sequence completion consists in reconstructing a sequence where eight consecutive states are missing. No actions are specified for the missing states. Good performance in this task indicates ability in reasoning on how it is possible to reach a certain goal state starting from the current one.

5.3 Synthesis Model Evaluation

In this section, we evaluate the performance of the synthesis model.

5.3.1 Comparison to Baselines.

We evaluate our method against Playable Environments (PE) [Menapace et al. 2022], the work most related to ours in that it builds a controllable 3D environment representation that is rendered with a compositional NeRF model where the position of each object is given and pose parameters are treated as a latent variable. Since the original method supports only outputs at 512 \(\times\) 288 px resolution, we produce baselines trained at both 512 \(\times\) 288 px and 1024 \(\times\) 576 px resolution, which we name PE and PE+, respectively. For a fair comparison, we also introduce in the baselines our same mechanism for representing ball blur and train a variant of our model using the same amount of computational resources as the baselines (Ours Small).

Results of the comparison are shown in Table 1, while qualitative results are shown in Figure 8. Our method scores best in terms of LPIPS, ADD, and MDR. Compared to PE+, our method produces significantly better FID and FVD scores. As shown in Figure 8, PE and PE+ produce checkerboard artifacts that are particularly noticeable on static scene elements such as judge stands, while our method produces sharp details. We attribute this difference to our ray sampling scheme and feature enhancer design that, in contrast to PE, do not sample rays at low resolution and perform upsampling, but rather directly operate on high resolution. In addition, thanks to our deformation and canonical space modeling strategies and higher resolution, our method produces more detailed players with respect to PE, where they frequently appear with missing limbs and blurred clothing. Finally, our model produces a realistic ball, while PE struggles to correctly model small objects, presumably due to its upsampling strategy that causes rays to be sampled more sparsely and thus do not intersect with the ball frequently enough to correctly render its blur effect.

Table 1.
TennisLPIPS\(\downarrow\)FID\(\downarrow\)FVD\(\downarrow\)ADD\(\downarrow\)MDR\(\downarrow\)
PE† [Menapace et al. 2022]0.18811.53493.740.200
PE+ [Menapace et al. 2022]0.23240.42432132.349.7
w/o enhancer \(\mathcal {F}\)0.16715.65703.020.0728
w/o explicit deformation in \(\mathcal {D}\)0.15613.35243.100.0587
w/o planes in \(\mathcal {C}\)0.24130.410642.940.0611
w/o voxels in \(\mathcal {C}\)0.17017.17573.030.0399
w/o our encoder \(\mathcal {E}\)0.17415.06003.180.0564
Ours Small0.15613.45232.880.0470
Ours0.15212.85162.880.0423
MinecraftLPIPS\(\downarrow\)FID\(\downarrow\)FVD\(\downarrow\)ADD\(\downarrow\)MDR\(\downarrow\)
PE† [Menapace et al. 2022]0.023513.921.55.770.0412
PE+ [Menapace et al. 2022]0.023815.551.7120.60.939
Ours Small0.009963.568.832.020.0529
Ours0.008142.817.081.980.0508
  • MDR in %, ADD in pixels. Note that FID and FVD are computed on images downscaled to the feature extractor training resolution, thus blurriness in the PE baseline caused by its reduced resolution is not captured by these metrics. LPIPS correctly reflects lack of sharpness in the PE results (see Figure 8). †denotes output in 512 \(\times\) 288 px rather than 1024 \(\times\) 576 px resolution.

Table 1. Comparison with Baselines and Ablation of the Synthesis Model

  • MDR in %, ADD in pixels. Note that FID and FVD are computed on images downscaled to the feature extractor training resolution, thus blurriness in the PE baseline caused by its reduced resolution is not captured by these metrics. LPIPS correctly reflects lack of sharpness in the PE results (see Figure 8). †denotes output in 512 \(\times\) 288 px rather than 1024 \(\times\) 576 px resolution.

Fig. 8.

Fig. 8. Synthesis model qualitative results on the Tennis dataset. Compared to PE [Menapace et al. 2022], our model generates sharper players and static scene elements. Our ablation study shows corruption of the player geometry when voxels or our deformation model are not used. When removing our canonical plane representation, static scene elements appear blurry. When our feature enhancer is removed, the model does not generate shadows and players lose quality.

5.3.2 Ablation.

To validate our design choices, we produce several variations of our method, each produced by removing one of our proposed architectural elements: We remove the enhancer \(\mathcal {F}\) and directly consider \(\mathbf {\tilde{I}}\) as our output; we remove the explicit deformation modeling procedure in \(\mathcal {D}\) of Section 3.1.4 and substitute it with an MLP directly predicting the deformation using a learnable pose code as in Menapace et al. [2022]; Tretschk et al. [2021]; we remove the plane-based canonical volume representation in \(\mathcal {C}\) for planar objects and use an MLP instead; we remove the voxel-based volume representation in \(\mathcal {C}\) and use an MLP instead; we substitute our style encoder \(\mathcal {E}\) with an ad hoc encoder for each object in the scene, following Menapace et al. [2022].

We perform the ablation on the Tennis dataset and show results in Table 1 and Figure 8. To reduce computation, we train the ablation models using the same hyperparameters as the “Ours Small” model.

When removing the enhancer \(\mathcal {F}\), our model produces players with fewer details and does not generate shadow effects below players (see first row in Figure 8). When our deformation modeling procedure is not employed, the method produces comparable LPIPS, FID, and FVD scores, but an analysis of the qualitatives shows that players may appear with corrupted limbs (see last row in Figure 8). In addition, the use of such learned pose representation would reduce the controllability of the synthesis model with respect to the use of an explicit kinematic tree. When plane-based or voxel-based canonical modeling is removed, we notice artifacts in the static scene elements, such as corrupted logos, and in the players, such as detached or doubled limbs. Finally, when we replace our style encoder design with the one of Menapace et al. [2022], we notice fewer details in scene elements.

5.4 Animation Model Evaluation

In this section, we evaluate the performance of the animation model.

5.4.1 Comparison to Baselines.

Similarly to the synthesis model, we compare our animation model against the one of Playable Environments (PE) [Menapace et al. 2022], the most related to our work, since it operates on a similar environment representation. While the baseline jointly learns discrete actions and generates sequences conditioned on such actions, we assume the text action representations to be available in our task, so, for fairness of evaluation, we introduce our same text encoder \(\mathcal {T}\) in the baseline to make use of the action information. To reduce computation, we perform the comparison using half of the computational resources and a reduced training schedule; consequently, we also retrain our model, producing a reduced variant (Ours Small). To render results, we always make use of our synthesis model.

We show results averaged over all inference tasks in Table 2 and report the results for each task in Supplement H.2. Our method outperforms the baseline in all evaluation tasks according to both L2 and FD metrics. From the qualitative results in Figure 9 and in accordance with the FD metrics, we notice that our method produces more realistic player poses with respect to PE that tends to keep player poses close to the average pose and to slide the players on the scene. We attribute this difference to the use of the diffusion framework in our method. Consider the example of generating a player walking forward. It is equally probable that the player moves the left or right leg first. In the case of a reconstruction-based training objective such as the main one of PE, the model is encouraged to produce an average leg movement result that consists in not moving the legs at all. However, diffusion models learn the multimodal distributions of the motion, thus they are able to sample one of the possible motions without averaging its predictions.

Table 2.
TennisPositionRoot angleJoints 3D
L2\(\downarrow\)FD\(\downarrow\)L2\(\downarrow\)FD\(\downarrow\)L2\(\downarrow\)FD\(\downarrow\)
PE3.291229.1121.12615.9530.30353.242
Rec. LSTM1.5977.2530.9077.0510.19316.735
Rec. Transf.1.0744.4020.7676.8380.17514.845
Ours Small1.3801.4431.0140.5600.1481.253
Ours1.0990.9290.8440.3560.1290.836
MinecraftPositionRoot angleJoints 3D
L2\(\downarrow\)FD\(\downarrow\)L2\(\downarrow\)FD\(\downarrow\)L2\(\downarrow\)FD\(\downarrow\)
PE2.739105.9731.62031.2320.31139.572
Rec. LSTM2.29247.2961.70249.9710.48999.843
Rec. Transf.2.15453.1981.43036.1230.38569.977
Ours Small1.0844.4611.0776.0160.1403.590
Ours1.0654.8150.9564.0830.1323.360
  • Position and Joints 3D in meters, Root angle in axis-angle representation.

Table 2. Animation Model Comparison with Baselines and Ablation with Results Averaged over All Inference Tasks

  • Position and Joints 3D in meters, Root angle in axis-angle representation.

Fig. 9.

Fig. 9. Qualitative results on the Tennis dataset. Sequences are produced in a video prediction setting that uses the first frame object properties and all actions as conditioning. The location of players is consistently closer to the ground truth for our method. Our method captures the multimodal distribution of player poses and generates vivid limb movements, while the baselines produce poses as the average of the distribution, resulting in reduced limb movement and tilted root joints. Additional samples are shown in Supplement H.2.

5.4.2 Ablation.

To validate this hypothesis and demonstrate the benefits of our diffusion formulation, we produce two variations of our method. The first substitutes the diffusion framework with a reconstruction objective, keeping the transformer-based architecture unaltered. The second, in addition to using the reconstruction objective, models \(\mathcal {A}\) using an LSTM, similarly to the PE baseline. Differently from the PE baseline, however, this variant does not make use of adversarial training and employs a single LSTM model for all objects, rather than a separate model for each.

We show results in Table 2. Our model consistently outperforms the baselines in terms of FD, showing a better ability to capture realistic sequences. Consistently with our assessment in Section 5.4.1, Figure 9 shows that our method trained with a reconstruction objective produces player movement with noticeable artifacts analogously to PE, validating the choice of the diffusion framework.

5.5 Limitations

Since the model is trained on a dataset showing only plausible actions, the model’s behavior is not defined when an implausible action is specified, such as hitting a ball while moving in the wrong direction to intercept it or jumping on a pillar that is out of reach. In these cases, we find the model to ignore the implausible part of the command and produce the closest plausible command or, less frequently, to produce implausible outcomes such as irrealistic long jumps (see Figure 10). In addition, the model does not generate actions extremely out of distribution such as performing a backflip or doing a push-up. This aspect could be addressed by jointly training the animation model on multiple diverse datasets, which we consider an interesting future direction.

Fig. 10.

Fig. 10. Behavior of the model when implausible actions are provided. In the left example, the model generates an irrealistic long jump to reach the specified pillar. In the right example, the bottom player is instructed to move left to intercept a ball coming to his right. In this case, the left movement command is ignored by the model to produce the closest plausible outcome.

While our Tennis dataset contains varied text annotations that allow the model to generalize to text inputs with varied structure, our Minecraft dataset’s synthetic text annotations are less varied, and the fixed synthetic structure of sentences tends to be memorized, making the model less effective if a different syntax is used (see Section H.1). To address this issue, a more sophisticated algorithm can be employed to generate action annotation on the Minecraft dataset.

Our model learns to associate referential language to scene coordinates rather than the appearance of the referred object, and the model memorizes the position of contact surfaces. While tennis scenes always have the same structure, for Minecraft the model cannot generalize to different scenes. This concern can be addressed by conditioning the animation model on the scene’s geometry, which we leave as future work.

We find our animation model to overfit to the Tennis dataset when less than 60% of the training data is used (see Supplement H.4). We leave as an interesting avenue of future work the investigation of regularization techniques such as dropout or weight decay, which have the potential to reduce overfitting in this scenario.

Our animation model outperforms baselines that operate under the same data assumptions [Menapace et al. 2022] in terms of animation quality. With respect to recent character animation methods [Starke et al. 2019; , 2020; Holden et al. 2020] making use of richly annotated motion capture data and dataset-specific handcrafted optimizations (see Section 2.3), our method demonstrates more advanced game dynamics and game AI modeling capabilities but produces foot-sliding artifacts. We expect continuous improvements in diffusion models to alleviate such artifacts and expect further improvements by considering different parametrizations of pose parameters taking into consideration the distance of limbs from the terrain, which we will explore in future work.

Last, our animation model does not yet produce results in real-time. We discuss inference speed and strategies to make the model real-time in Supplement F.1. Improving the sampling speed of diffusion models is an actively investigated problem [Salimans and Ho 2022; Meng et al. 2022; Song et al. 2021] that is orthogonal to ours.

Skip 6CONCLUSIONS Section

6 CONCLUSIONS

In this article, we demonstrate the feasibility of learning game models able to answer challenging user prompts and show that textual action representations are critical for unlocking fine-grained control over the generation process and enabling compelling constraint- and goal-driven generation applications. These results, jointly with two richly annotated text-video datasets, pave the way towards learning game models for complex, real-world scenes.

Skip ACKNOWLEDGEMENTS Section

ACKNOWLEDGEMENTS

We would like to thank Christian Theobalt for his feedback on the manuscript draft, Denys Poluyanov, Eugene Shevchuk, and Oleksandr Pyshchenko for the useful discussion and validation of the use cases of PGMs, Maryna Diakonova for her support in data labeling, and Anton Kuzmenko and Vadym Hrebennyk for their assistance in creating the accompanying video.

Skip Supplemental Material Section

Supplemental Material

In The Eleventh International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=nJfylDvgzlq

REFERENCES

  1. Achlioptas Panos, Huang Ian, Sung Minhyuk, Tulyakov Sergey, and Guibas Leonidas. 2023. ChangeIt3D: Language-assisted 3D shape edits and deformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’23).Google ScholarGoogle Scholar
  2. Athanasiou Nikos, Petrovich Mathis, Black Michael J., and Varol Gül. 2022. TEACH: Temporal action compositions for 3D humans. In Proceedings of the International Conference on 3D Vision (3DV’22).Google ScholarGoogle ScholarCross RefCross Ref
  3. Babaeizadeh Mohammad, Finn Chelsea, Erhan Dumitru, Campbell Roy H., and Levine Sergey. 2018. Stochastic variational video prediction. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  4. Bain Max, Nagrani Arsha, Varol Gül, and Zisserman Andrew. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’21).Google ScholarGoogle ScholarCross RefCross Ref
  5. Blattmann Andreas, Rombach Robin, Ling Huan, Dockhorn Tim, Kim Seung Wook, Fidler Sanja, and Kreis Karsten. 2023. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’23).Google ScholarGoogle ScholarCross RefCross Ref
  6. Büttner Michael and Clavet Simon. 2015. Motion matching—The road to next gen animation. In Proceedings of Nucl.AI.Google ScholarGoogle Scholar
  7. Chan Eric R., Lin Connor Z., Chan Matthew A., Nagano Koki, Pan Boxiao, Mello Shalini De, Gallo Orazio, Guibas Leonidas, Tremblay Jonathan, Khamis Sameh, Karras Tero, and Wetzstein Gordon. 2022. Efficient geometry-aware 3D generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle ScholarCross RefCross Ref
  8. Chen Anpei, Xu Zexiang, Geiger Andreas, Yu Jingyi, and Su Hao. 2022. TensoRF: Tensorial radiance fields. In Proceedings of the European Conference on Computer Vision (ECCV’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen Nanxin, Zhang Yu, Zen Heiga, Weiss Ron J., Norouzi Mohammad, and Chan William. 2021. WaveGrad: Estimating gradients for waveform generation. In Proceedings of the International Conference on Learning Representations (ICLR’21).Google ScholarGoogle Scholar
  10. Chiappa Silvia, Racanière Sébastien, Wierstra Daan, and Mohamed Shakir. 2017. Recurrent environment simulators. arXiv (2017).Google ScholarGoogle Scholar
  11. Curtis Cassidy, Adalgeirsson Sigurdur Orn, Ciurdar Horia Stefan, McDermott Peter, Velásquez J. D., Knox W. Bradley, Martinez Alonso, Gaztelumendi Dei, Goussies Norberto Adrian, Liu Tianyu, and Nandy Palash. 2022. Toward believable acting for autonomous animated characters. In Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dabral Rishabh, Mughal Muhammad Hamza, Golyanik Vladislav, and Theobalt Christian. 2023. MoFusion: A framework for denoising-diffusion-based motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23).Google ScholarGoogle ScholarCross RefCross Ref
  13. Davtyan Aram and Favaro Paolo. 2022. Controllable video generation through global and local motion dynamics. In Proceedings of the European Conference of Computer Vision (ECCV’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler. 2022. Continuous diffusion for categorical data. Google ScholarGoogle ScholarCross RefCross Ref
  15. Fortuin Vincent, Baranchuk Dmitry, Rätsch Gunnar, and Mandt Stephan. 2020. GP-VAE: Deep probabilistic time series imputation. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR.Google ScholarGoogle Scholar
  16. Fréchet Maurice. 1957. Sur la distance de deux lois de probabilité. Comptes Rendus Hebdom. Seances Acad. Sci. 244, 6 (1957), 689–692.Google ScholarGoogle Scholar
  17. Fridovich-Keil Sara, Yu Alex, Tancik Matthew, Chen Qinhong, Recht Benjamin, and Kanazawa Angjoo. 2022. Plenoxels: Radiance fields without neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle ScholarCross RefCross Ref
  18. Fu Tsu-Jui, Yu Licheng, Zhang Ning, Fu Cheng-Yang, Su Jong-Chyi, Wang William Yang, and Bell Sean. 2023. Tell me what happened: Unifying text-guided video completion via multimodal masked video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23).Google ScholarGoogle ScholarCross RefCross Ref
  19. Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13).Google ScholarGoogle Scholar
  20. Gregory Jason. 2018. Game Engine Architecture. CRC Press. Google ScholarGoogle Scholar
  21. Han Ligong, Ren Jian, Lee Hsin-Ying, Barbieri Francesco, Olszewski Kyle, Minaee Shervin, Metaxas Dimitris, and Tulyakov Sergey. 2022. Show me what and tell me how: Video synthesis via multimodal conditioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle ScholarCross RefCross Ref
  22. Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, and Hochreiter Sepp. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS’17).Google ScholarGoogle Scholar
  23. Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. 2022. Imagen Video: High Definition Video Generation with Diffusion Models. Google ScholarGoogle ScholarCross RefCross Ref
  24. Ho Jonathan, Jain Ajay, and Abbeel Pieter. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS’20).Google ScholarGoogle Scholar
  25. Ho Jonathan, Salimans Tim, Gritsenko Alexey A., Chan William, Norouzi Mohammad, and Fleet David J.. 2022b. Video diffusion models. In Proceedings of the ICLR Workshop on Deep Generative Models for Highly Structured Data.Google ScholarGoogle Scholar
  26. Holden Daniel, Kanoun Oussama, Perepichka Maksym, and Popa Tiberiu. 2020. Learned motion matching. ACM Trans. Graph. 39, 40 (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Holden Daniel, Komura Taku, and Saito Jun. 2017. Phase-functioned neural networks for character control. ACM Trans. Graph. 36, 4 (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. Google ScholarGoogle ScholarCross RefCross Ref
  29. Huang Jiahui, Jin Yuhe, Yi Kwang Moo, and Sigal Leonid. 2022. Layered controllable video generation. In Proceedings of the European Conference of Computer Vision (ECCV’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jain Ajay, Mildenhall Ben, Barron Jonathan T., Abbeel Pieter, and Poole Ben. 2022. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle ScholarCross RefCross Ref
  31. Johnson Justin, Alahi Alexandre, and Fei-Fei Li. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference of Computer Vision (ECCV’16).Google ScholarGoogle ScholarCross RefCross Ref
  32. Karras Tero, Laine Samuli, Aittala Miika, Hellsten Janne, Lehtinen Jaakko, and Aila Timo. 2020. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  33. Kim Seung Wook, Philion Jonah, Torralba Antonio, and Fidler Sanja. 2021. DriveGAN: Towards a controllable high-quality neural simulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21).Google ScholarGoogle ScholarCross RefCross Ref
  34. Kim Seung Wook, Zhou Yuhao, Philion Jonah, Torralba Antonio, and Fidler Sanja. 2020. Learning to simulate dynamic environments with GameGAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  35. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, (ICLR’15), San Diego, CA, Conference Track Proceedings. Retrieved from http://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  36. Kong Zhifeng, Ping Wei, Huang Jiaji, Zhao Kexin, and Catanzaro Bryan. 2020. DiffWave: A versatile diffusion model for audio synthesis. In Proceedings of the International Conference on Learning Representations (ICLR’20).Google ScholarGoogle Scholar
  37. Kundu Abhijit, Genova Kyle, Yin Xiaoqi, Fathi Alireza, Pantofaru Caroline, Guibas Leonidas J., Tagliasacchi Andrea, Dellaert Frank, and Funkhouser Thomas A.. 2022. Panoptic neural fields: A semantic object-aware neural scene representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle Scholar
  38. Kwon Yong-Hoon and Park Min-Gyu. 2019. Predicting future frames using retrospective cycle GAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  39. Lam Max W. Y., Wang Jun, Su Dan, and Yu Dong. 2022. BDDM: Bilateral denoising diffusion models for fast and high-quality speech synthesis. In Proceedings of the International Conference on Learning Representations (ICLR’22).Google ScholarGoogle Scholar
  40. Lee Kyungho, Lee Seyoung, and Lee Jehee. 2018. Interactive character animation by learning multi-objective control. ACM Trans. Graph. 37, 6 (2018).Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Leng Yichong, Chen Zehua, Guo Junliang, Liu Haohe, Chen Jiawei, Tan Xu, Mandic Danilo, He Lei, Li Xiangyang, Qin Tao, zhao sheng, and Liu Tie-Yan. 2022. BinauralGrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS’22).Google ScholarGoogle Scholar
  42. Lewis J. P., Cordner Matt, and Fong Nickson. 2000. Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In Proceedings of the SIGGRAPH Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Li Ruilong, Tanke Julian, Vo Minh, Zollhofer Michael, Gall Jurgen, Kanazawa Angjoo, and Lassner Christoph. 2022. TAVA: Template-free animatable volumetric actors. In Proceedings of the European Conference of Computer Vision (ECCV’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Lin Chen-Hsuan, Gao Jun, Tang Luming, Takikawa Towaki, Zeng Xiaohui, Huang Xun, Kreis Karsten, Fidler Sanja, Liu Ming-Yu, and Lin Tsung-Yi. 2023. Magic3D: High-resolution text-to-3D content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23).Google ScholarGoogle ScholarCross RefCross Ref
  45. Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2020. RoBERTa: A robustly optimized BERT pretraining approach. In Proceedings of the International Conference on Learning Representations (ICLR’20).Google ScholarGoogle Scholar
  46. Loper Matthew, Mahmood Naureen, Romero Javier, Pons-Moll Gerard, and Black Michael J.. 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 34, 6 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Martel Julien N. P., Lindell David B., Lin Connor Z., Chan Eric R., Monteiro Marco, and Wetzstein Gordon. 2021. Acorn: Adaptive coordinate networks for neural scene representation. ACM Trans. Graph. 40, 4 (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Menapace Willi, Lathuiliere Stephane, Tulyakov Sergey, Siarohin Aliaksandr, and Ricci Elisa. 2021. Playable video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21).Google ScholarGoogle ScholarCross RefCross Ref
  49. Menapace Willi, Lathuilière Stéphane, Siarohin Aliaksandr, Theobalt Christian, Tulyakov Sergey, Golyanik Vladislav, and Ricci Elisa. 2022. Playable environments: Video manipulation in space and time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle ScholarCross RefCross Ref
  50. Meng Chenlin, Gao Ruiqi, Kingma Diederik P., Ermon Stefano, Ho Jonathan, and Salimans Tim. 2022. On distillation of guided diffusion models. In Proceedings of the NeurIPS 2022 Workshop on Score-based Methods.Google ScholarGoogle Scholar
  51. Miech Antoine, Zhukov Dimitri, Alayrac Jean-Baptiste, Tapaswi Makarand, Laptev Ivan, and Sivic Josef. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19).Google ScholarGoogle ScholarCross RefCross Ref
  52. Mildenhall Ben, Srinivasan Pratul P., Tancik Matthew, Barron Jonathan T., Ramamoorthi Ravi, and Ng Ren. 2020. NeRF: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference of Computer Vision (ECCV’20).Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Müller Norman, Simonelli Andrea, Porzi Lorenzo, Bulò Samuel Rota, Nießner Matthias, and Kontschieder Peter. 2022. AutoRF: Learning 3D object radiance fields from single view observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle ScholarCross RefCross Ref
  54. Müller Thomas, Evans Alex, Schied Christoph, and Keller Alexander. 2022. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (2022). Retrieved from https://arxiv.org/abs/2201.05989Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Müller Thomas, Rousselle Fabrice, Keller Alexander, and Novák Jan. 2020. Neural control variates. ACM Trans. Graph. 39, 6 (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Niemeyer Michael and Geiger Andreas. 2021. GIRAFFE: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21).Google ScholarGoogle ScholarCross RefCross Ref
  57. Oh Junhyuk, Guo Xiaoxiao, Lee Honglak, Lewis Richard L., and Singh Satinder. 2015. Action-conditional video prediction using deep networks in Atari games. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’15).Google ScholarGoogle Scholar
  58. Ost Julian, Mannan Fahim, Thuerey Nils, Knodt Julian, and Heide Felix. 2021. Neural scene graphs for dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21).Google ScholarGoogle ScholarCross RefCross Ref
  59. Park Keunhong, Sinha Utkarsh, Barron Jonathan T., Bouaziz Sofien, Goldman Dan B., Seitz Steven M., and Martin-Brualla Ricardo. 2021a. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’21).Google ScholarGoogle Scholar
  60. Park Keunhong, Sinha Utkarsh, Hedman Peter, Barron Jonathan T., Bouaziz Sofien, Goldman Dan B., Martin-Brualla Ricardo, and Seitz Steven M.. 2021b. HyperNeRF: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph. 40, 6 (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2022. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1 (2022).Google ScholarGoogle Scholar
  62. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. Google ScholarGoogle ScholarCross RefCross Ref
  63. Ramesh Aditya, Pavlov Mikhail, Goh Gabriel, Gray Scott, Voss Chelsea, Radford Alec, Chen Mark, and Sutskever Ilya. 2021. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning (ICML’21).Google ScholarGoogle Scholar
  64. Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS’15).Google ScholarGoogle Scholar
  65. ReplayMod. 2022. ReplayMod. Retrieved from https://github.com/ReplayMod/ReplayModGoogle ScholarGoogle Scholar
  66. Rombach Robin, Blattmann Andreas, Lorenz Dominik, Esser Patrick, and Ommer Björn. 2021. High-resolution image synthesis with latent diffusion models. arXiv (2021).Google ScholarGoogle Scholar
  67. Ronneberger Olaf, Fischer Philipp, and Brox Thomas. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI’15).Google ScholarGoogle ScholarCross RefCross Ref
  68. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems. Retrieved from https://openreview.net/forum?id=08Yk-n5l2AlGoogle ScholarGoogle Scholar
  69. Salimans Tim and Ho Jonathan. 2022. Progressive distillation for fast sampling of diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR’22).Google ScholarGoogle Scholar
  70. Schuhmann Christoph, Beaumont Romain, Vencu Richard, Gordon Cade W., Wightman Ross, Cherti Mehdi, Coombes Theo, Katta Aarush, Mullis Clayton, Wortsman Mitchell, Schramowski Patrick, Kundurthy Srivatsa R., Crowson Katherine, Schmidt Ludwig, Kaczmarczyk Robert, and Jitsev Jenia. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’22) Datasets and Benchmarks Track.Google ScholarGoogle Scholar
  71. Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. 2023. Make-A-Video: Text-to-video generation without text-video data.Google ScholarGoogle Scholar
  72. Song Jiaming, Meng Chenlin, and Ermon Stefano. 2021. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR’21).Google ScholarGoogle Scholar
  73. Stanton Matt, Geddert Sascha, Blumer Adrian, Hormis Paul, Nealen Andy, Cooper Seth, and Treuille Adrien. 2016. Large-scale finite state game engines. In Proceedings of the Eurographics/ACM SIGGRAPH Symposium on Computer Animation.Google ScholarGoogle Scholar
  74. Starke Sebastian, Zhang He, Komura Taku, and Saito Jun. 2019. Neural state machine for character-scene interactions. ACM Trans. Graph. 38, 6 (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Starke Sebastian, Zhao Yiwei, Komura Taku, and Zaman Kazi. 2020. Local motion phases for learning multi-contact character movements. ACM Trans. Graph. 39, 4 (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Tashiro Yusuke, Song Jiaming, Song Yang, and Ermon Stefano. 2021. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS’21).Google ScholarGoogle Scholar
  77. Tevet Guy, Gordon Brian, Hertz Amir, Bermano Amit H., and Cohen-Or Daniel. 2022. MotionCLIP: Exposing human motion generation to CLIP space. In Proceedings of the European Conference of Computer Vision (ECCV’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Tretschk Edgar, Tewari Ayush, Golyanik Vladislav, Zollhöfer Michael, Lassner Christoph, and Theobalt Christian. 2021. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’21).Google ScholarGoogle ScholarCross RefCross Ref
  79. Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric & challenges. CoRR abs/1812.01717, (2018). Retrieved from http://arxiv.org/abs/1812.01717Google ScholarGoogle Scholar
  80. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS’17).Google ScholarGoogle Scholar
  81. Weng Chung-Yi, Curless Brian, Srinivasan Pratul P., Barron Jonathan T., and Kemelmacher-Shlizerman Ira. 2022. HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarGoogle ScholarCross RefCross Ref
  82. Yinghao Xu, Menglei Chai, Zifan Shi, Sida Peng, Ivan Skorokhodov, Aliaksandr Siarohin, Ceyuan Yang, Yujun Shen, Hsin-Ying Lee, Bolei Zhou, and Sergey Tulyakov. 2023. DisCoScene: Spatially disentangled generative radiance fields for controllable 3D-aware scene synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23). 4402–4412.Google ScholarGoogle Scholar
  83. Yu Alex, Li Ruilong, Tancik Matthew, Li Hao, Ng Ren, and Kanazawa Angjoo. 2021. PlenOctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’21).Google ScholarGoogle ScholarCross RefCross Ref
  84. Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-driven human motion generation with diffusion model. Google ScholarGoogle ScholarCross RefCross Ref
  85. Zhang Richard, Isola Phillip, Efros Alexei A., Shechtman Eli, and Wang Oliver. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Promptable Game Models: Text-guided Game Simulation via Masked Diffusion Models

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Graphics
        ACM Transactions on Graphics  Volume 43, Issue 2
        April 2024
        199 pages
        ISSN:0730-0301
        EISSN:1557-7368
        DOI:10.1145/3613549
        • Editor:
        • Carol O'Sullivan
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 January 2024
        • Online AM: 5 December 2023
        • Accepted: 13 November 2023
        • Revised: 30 September 2023
        • Received: 2 May 2023
        Published in tog Volume 43, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)956
        • Downloads (Last 6 weeks)269

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader