Back to AI guides

Understanding STARFlow: Apple's Autoregressive Flow Model for Generative AI

Stanley Ulili
Updated on December 7, 2025

The ability to generate high-quality images and videos from simple text prompts has become a defining frontier in artificial intelligence. While companies like OpenAI with DALL-E and Midjourney have captured public attention, Apple has entered the arena with an unexpected contribution: STARFlow, an open-source AI model for image and video generation.

STARFlow (Scalable Transformer Autoregressive Flow) introduces a novel architecture that creates stunning visuals with significantly lower computational demands than its competitors. Apple has made it free and open-source, allowing developers and researchers worldwide to explore its capabilities.

This article explores STARFlow's unique architecture, examining how it combines autoregressive models with normalizing flows to achieve faster generation times and precise editing capabilities. You'll learn what sets STARFlow apart from diffusion models, how its hybrid approach works, and what this means for the future of generative AI.

What is STARFlow? Apple's bold move in generative AI

For years, Apple maintained a tightly controlled, closed ecosystem. The company's decision to release a state-of-the-art AI model as open-source represents a significant strategic shift, signaling a new era for Apple's involvement in the AI community.

Apple's growing open-source presence

Apple has steadily increased its open-source contributions. Their official GitHub page reveals over 375 public repositories, with an impressive 200 focused on machine learning. This strategic move suggests Apple is actively working to shape the AI revolution by fostering innovation around its research. STARFlow represents the latest and perhaps most exciting example of this philosophy.

A screenshot of Apple's GitHub page showing a search for "ml-" yielding 200 results, indicating their extensive work in machine learning.

The STARFlow models: image and video generation

The STARFlow repository contains architecture for two distinct but related models, each designed for a specific creative task:

STARFlow (3B Parameters, Text-to-Image): The foundational model synthesizes high-quality images from text descriptions. It operates on a 3-billion parameter architecture, making it relatively lightweight compared to other massive models in the field.

STARFlow-V (7B Parameters, Text-to-Video): The more powerful sibling boasts a 7-billion parameter architecture. It extends STARFlow's core principles to the temporal domain, generating short video clips (around 5 seconds at 16 FPS) directly from text prompts.

The Model Architecture section from the STARFlow GitHub README, clearly distinguishing between the Text-to-Image and Text-to-Video models and their respective parameters.

A new architecture for a new era

STARFlow's unique hybrid architecture combines the expressive power of autoregressive models with the efficiency of normalizing flows. This fusion allows STARFlow to achieve state-of-the-art results while being significantly faster and more computationally efficient during inference. This efficiency could eventually unlock the possibility of running powerful generative AI directly on consumer devices like MacBooks and iPhones.

Unpacking the technology: how STARFlow works its magic

Understanding STARFlow requires breaking down the core concepts behind its design into manageable components.

The autoregressive approach: building content sequentially

An autoregressive model generates data one piece at a time, where each new piece is predicted based on all previous pieces. Consider autocomplete on your phone or in ChatGPT. When it suggests the next word in your sentence, it's not guessing randomly but analyzing the sequence of words you've already typed to predict the most probable next word. This is an autoregressive process.

STARFlow applies this principle to images. Instead of generating an entire image at once, an autoregressive image model can generate it pixel by pixel or patch by patch. The color and properties of each new pixel are determined by the pixels that have already been generated.

An animation demonstrating the concept of pixel-by-pixel generation, where an image of a Santa hat is constructed one square at a time in a grid.

A different path: STARFlow vs. diffusion models

Most popular image generation models today, like Stable Diffusion, are diffusion models. Their process differs significantly. They start with a canvas of pure random static, or Gaussian noise. Through dozens or even hundreds of steps, a neural network meticulously "denoises" this static, gradually refining it until a coherent image matching the text prompt emerges. While incredibly powerful, this iterative process can be slow and computationally intensive.

A visual representation of the diffusion process, showing a grid of images progressing from complete noise on the left to a clear image on the right through sequential "denoise" steps.

The secret sauce: normalizing flows

STARFlow's first major innovation uses a concept called normalizing flow. A normalizing flow model employs mathematical techniques that transform a simple, well-understood probability distribution (like random Gaussian noise) into a much more complex and structured one (like the distribution of pixels forming a realistic image).

The transformation uses a series of invertible mathematical functions. You can transform simple noise into a complex image, but you can also take that final image and perfectly reverse the process to get back to the exact initial noise it came from. This invertibility provides powerful capabilities for image editing.

The hybrid genius: fusing the best of both worlds

STARFlow's design is both unusual and brilliant: it's an autoregressive model that also uses Gaussian noise as its starting point, much like a diffusion model.

STARFlow transforms a clean, simple noise pattern into a final, coherent image. However, instead of using a multi-step denoising process like diffusion, it uses invertible flow layers to accomplish the entire transformation in a single forward pass. This makes generation fundamentally faster. The "autoregressive" aspect of the model applies not to the pixels themselves but to the operation of these powerful flow layers.

Taming computational cost with Parallel Jacobi Iterations

A major challenge with autoregressive models is their computational cost, which can explode as the model processes more image patches. The calculations needed to predict each new patch grow quadratically, making high-resolution image generation very slow.

STARFlow overcomes this with Parallel Jacobi Iterations, a method for solving systems of equations by running calculations for different parts of the problem simultaneously. By applying this to its flow layers, STARFlow can process multiple image patches at once, dramatically reducing generation time. According to Apple's research paper, this optimization makes the inference process up to 15 times faster, a massive leap forward, especially for video generation demands.

A simplified animation illustrating Parallel Jacobi Iterations, where multiple rows of a grid are processed simultaneously before converging on a final shape.

The power of invertibility: unlocking precise editing

Normalizing flows' invertibility is a game-changer for image and video editing. Because you can perfectly map between the final image and its underlying noise representation, you can make highly targeted edits.

For example, you could take a generated video, change a small part of it (like turning an orange into a lemon), and then regenerate the video. The model would only alter the pixels related to the fruit, leaving the rest of the scene (the hands, the background, the lighting) perfectly intact and consistent. This level of precise, localized control is extremely difficult to achieve with traditional diffusion models.

Setting up and running STARFlow

STARFlow is a research project requiring a specific environment. You'll need a machine with a powerful NVIDIA GPU, a Linux environment, Python, Conda for environment management, and Git.

Setting up your environment

The repository includes a setup script for a Conda environment. Clone the repository from GitHub:

 
git clone https://github.com/apple/ml-starflow

Navigate into the directory:

 
cd ml-starflow

Set up the environment using the provided script:

 
bash scripts/setup_conda.sh

Alternatively, you can manage your environment manually by installing required packages:

 
pip install -r requirements.txt

Downloading the model checkpoint

The code repository doesn't include the actual trained model weights, known as "checkpoints." These large files need to be downloaded separately from the Hugging Face model hub at apple/starflow.

In the Files tab, you'll find the checkpoint file. The main one available is starflow_3B_t2i_256x256.pth. Create a directory named ckpts inside your ml-starflow folder and place the downloaded .pth file there. Your file path should be ml-starflow/ckpts/starflow_3B_t2i_256x256.pth.

Generating AI images with STARFlow

The STARFlow repository provides scripts that simplify image generation from the command line.

Example: a cat playing the piano

The provided sample script handles complex command-line arguments. Running the script with a text prompt:

 
bash scripts/test_sample_image.sh "a film still of a cat playing piano"

The script loads the model and generates images. By default, it creates a batch of 8 images and saves them as a single grid. The results show different cats in various poses near pianos. However, common AI artifacts may appear, such as strangely formed paws, warped piano keys, or odd lighting. This is typical for generative models, especially in early research stages.

Example: a polar bear in the desert

Another prompt demonstrates how the model handles surreal concepts:

 
bash scripts/test_sample_image.sh "a polar bear running in the desert"

The model produces a grid of 8 images showing a variety of styles, from photorealistic attempts to more cartoonish or illustrative renderings. This demonstrates the model's ability to interpret a prompt in different creative ways. While some images may look fantastic, others might have minor flaws like strange shadows or awkward poses.

Customizing your generations

For more control, you can use the torchrun command directly:

 
torchrun --standalone --nproc_per_node 1 sample.py \
--model_config_path "configs/starflow_3B_t2i_256x256.yaml" \
--checkpoint_path "ckpts/starflow_3B_t2i_256x256.pth" \
--caption "your custom prompt here" \
--sample_batch_size 1 \
--cfg 3.0 \
--aspect_ratio "1:1" \
--seed 999

Key parameters you can adjust:

--caption: Your text prompt for image generation.

--sample_batch_size: Number of images to generate. Setting this to 1 produces a single image instead of a grid of 8.

--cfg: Classifier-Free Guidance scale. This number controls how strongly the model adheres to your prompt. Higher values generally mean more adherence but can sometimes reduce creativity.

--aspect_ratio: Different aspect ratios like 16:9 or 9:16 for varied compositions.

--seed: This number initializes the random noise. Using the same seed with the same prompt produces the exact same image every time, useful for reproducibility.

The future of STARFlow: video generation and beyond

While the text-to-image model is available for experimentation, the most exciting frontier for STARFlow is video.

The promise of STARFlow-V

The STARFlow-V model for text-to-video generation is the true powerhouse of this research. Apple has not yet released a public checkpoint for this model, so it remains inaccessible for direct experimentation.

The examples provided on their project website showcase capabilities that go beyond simple text-to-video generation:

Video-to-video editing: The ability to make precise edits, like changing the color of a vase in a moving video while keeping everything else consistent.

Video outpainting: Extending the frame of a video, seamlessly generating new content at the edges to change its aspect ratio or follow a subject out of the original frame.

These advanced editing features, enabled by the invertible normalizing flow architecture, could revolutionize video post-production workflows.

Can it run on your iPhone?

The current STARFlow implementation requires a specific Python and PyTorch environment and relies on powerful NVIDIA GPUs with CUDA, none of which are standard on Apple's consumer hardware. The model is not yet optimized to run on Apple's M-series chips or their Metal graphics API.

While STARFlow's efficiency makes this a future possibility, significant optimization and integration work would be required. A future, highly optimized version could potentially perform image generation on a high-end MacBook Pro, but real-time video generation on an iPhone remains a distant prospect.

A research project with massive potential

STARFlow is currently a research release, a demonstration of a new technique rather than a polished consumer product. The publicly available model is limited to a low resolution (256x256), even though Apple's research paper showcases much higher-resolution examples. This indicates Apple has more powerful versions internally and is sharing the foundational technology with the community to spur further innovation.

Final thoughts

STARFlow is an important move for Apple and an interesting step forward for the whole AI world. Apple’s researchers have combined autoregressive models with the speed and flexibility of normalizing flows, creating a new way to build generative AI that is more efficient without losing quality. On top of that, it can generate results in a single pass, and its invertible design allows very precise edits. These are real breakthroughs that could influence the future of the field for a long time.

Apple has also released STARFlow as open-source. This decision not only adds to the shared knowledge of the AI community, it also shows that Apple wants to be a leading force in this fast-changing area. It might take some time before STARFlow appears in Apple’s products, but its release is already a key moment. Step by step, AI-powered creativity is becoming faster, more efficient, and, thanks to Apple, a little more open to everyone.