paper-reading1

记录一下读论文

第四周

主要调研了一下长视频生成工作

PHENAKI: variable length video generation from open domain textual descriptions

背景与挑战

  1. 缺少数据与算力
  2. 简单的提示词不足以充分描述一个视频:
    Ideally,a video generation model must be able to generate videos of arbitrary length, all the while having the capability of conditioning the generated frames at time t on prompts at time t that can vary over time. Such capability can clearly distinguish the video from a “moving image”

结构

C-ViViT

ViViT原来用于视频的分类,相较于原来的Transformer加入了时间层,这里用于编码为token以及解码。
C-ViViT做了一些改动:

  1. remove the [CLS] tokens in the spatial and the temporal transformers
  2. apply temporal transformer for all spatial tokens computed by the spatial encoder, in contrast to single run of the temporal transformer over the [CLS] tokens in ViViT
  3. Most importantly, the ViViT encoder requires a fixed length video input due to the all-to-all attention in time. Therefore, we apply causal attention instead such that our C-ViViT encoder becomes autoregressive and allows for a variable number of input frames which are necessary to learn from image datasets, and auto-regressively extrapolate video or single frames into the future.

名词解释

链接

CausalSelfAttention
是一种自注意力机制,主要用于序列到序列的任务中,例如机器翻译、文本摘要等。它的主要特点是:

  • 仅允许注意力模型访问当前和之前的输入,而不能访问之后的输入,这符合序列到序列任务的因果性要求。例如在机器翻译任务中,翻译某个词时不能访问该词之后的内容。
  • 通过掩盖(masking)未来的输入来实现因果性。具体方法是在计算注意力权重时,对角线以上的权重置零,从而避免模型看到未来的信息。
  • 可用于 Transformer 和其变种模型中捕捉序列的时间依赖关系。Transformer 原本是并行计算的,使用 CausalSelfAttention 赋予其顺序处理能力。
  • 相比简单的注意力机制,CausalSelfAttention 学习到的表示更符合时间顺序,更有利于序列生成任务。
  • 一般用于编码器端,而解码器端则使用 MaskedSelfAttention,它允许解码器访问编码器输出的所有时刻,以使用整个输入序列的上下文信息。

autoregressive自回归模型
是一类机器学习(ML)模型,通过对序列中先前的输入进行测量来自动预测序列中的下一个分量。

MaskGiT

链接

Transformer-based的图像生成基本完全参考NLP处理序列数据的做法,需要两个步骤:

  1. Tokenization:自然语言都是离散值,而图像是连续值,想像NLP一样处理必须先离散化,iGPT里直接把图像变成一个个马赛克色块,ViT则是切成多块后分别进行线性映射,还有的方法专门学了一个自编码器,用encoder把图像映射成token,再用decoder还原
  2. Autoregressive Prediction:用单向Transformer一个个token地预测,最终生成图像

MaskGIT的核心思想,就是参考人的作画逻辑,先生成一部分token,再逐渐去完善。

这里应该是用来预测下一个时刻的token(见右图)

过程

第一帧通过Empty Tokens与text prompt过一次Transformer生成起始的视频。
接着预测下一个时刻的视频帧,先将对应的token标为空,再和下一个时刻的prompt过一次Transformer来预测。

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

背景与挑战

计算能力限制着长视频的生成,先前的模型use the “Autoregressive over X” architecture, where “X” denotes any generative models capable of generating short video clips,主要是能够在短视频上训练,也能生成长视频

但有两个问题:

  1. 很难维持一致性 Firstly, training on short videos but forcing it to infer long videos leads to an enormous training-inference gap. It can result in unrealistic shot change and long-term incoherence in generated long videos, since the model has no opportunity to learn such patterns from long videos.
  2. 无法并行 Secondly, due to the dependency limitation of the sliding window, the inference process can not be done in parallel and thus takes a much longer time.

结构

就是按二分法,先生成关键帧(GlobalDiffusion),再不断生成中间帧(LocalDiffusion)

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

It has three parts:

  1. A time-space compressor first maps the original video into latent space.
  2. A ViT then processes the tokenized latent representation and outputs the denoised latent representation.
  3. A CLIP-like [26] conditioning mechanism receives LLM-augmented user instructions and potentially visual prompts to guide the diffusion model to generate styled or themed videos.

第一部分,数据预处理

Sora employs spacetime latent patches as its building blocks. Specifically, Sora compresses a raw input video into a latent spacetime representation. Then, a sequence of latent spacetime patches is extracted from the compressed video to encapsulate both the visual appearance and motion dynamics over brief intervals. These patches, analogous to word tokens in language models, provide Sora with detailed visual phrases to be used to construct videos.

通过这样的压缩与转换,一方面压缩了数据的维度减少计算量,另一方面,可以处理不同大小不同时间的数据,保留了数据原本的信息(之前的操作是直接裁剪)

ViT ViViT

压缩后的数据有不同的维度,使用了patch n’ pack (PNP)技术patchify the latent for transformer token

第二部分 模型架构

对于图像模型,用Transformer架构替换原来的Unet可以训练更多的数据并有更多的参数,见DiT和U-ViT

将DiTs用于视频有三个问题:

  1. how to compress the video spatially and temporally to a latent space for efficient denoising;
  2. how to convert the compressed latent to patches and feed them to the transformer;
  3. how to handle long-range temporal and spatial dependencies and ensure content consistency.

相似的有两个工作,Imagen Video and Video LDM

Imagen Video, a text-to-video generation system developed by Google Research, utilizes a cascade of diffusion models, which consists of 7 sub-models that perform text-conditional video generation, spatial super-resolution, and temporal super-resolution, to transform textual prompts into high-definition videos.

we speculate that Sora also leverages cascade diffusion model architecture which is composed of a base model and many space-time refiner models.

第三部分 Language Instruction Following

To enhance the ability of instruction following, Sora adopts a similar caption improvement approach. This method is achieved by first training a video captioner capable of producing detailed descriptions for videos. Then, this video captioner is applied to all videos in the training data to generate high-quality (video, descriptive caption) pairs, which are used to fine-tune Sora to improve its instruction following ability