The Phenaki AI model has the ability to create videos of several minutes in length directly from text. It also allows for the creation of videos from a single image and a prompt. Compared to existing per-frame baselines in the literature, the suggested video encoder-decoder surpasses them in both spatio-temporal quality and the number of tokens generated per video. The process of generating video tokens from text involves using a bidirectional masked transformer that is conditioned on pre-calculated text tokens. The generated video tokens are then transformed back into the final video format.