Back to Scaling Python applications guides

Runpod Flash: deploying serverless GPU functions from Python

Stanley Ulili
Updated on March 23, 2026

Runpod Flash is a Python SDK that turns decorated functions into serverless GPU endpoints. It removes the need to write a Dockerfile, push to a container registry, or configure a cloud deployment. A function decorated with @Endpoint is automatically packaged, deployed to Runpod's infrastructure, executed on the requested hardware, and torn down when the job completes.

The traditional GPU deployment workflow

Deploying a Python script that requires GPU acceleration in the cloud has historically involved several steps that have nothing to do with the application logic itself.

Infographic comparing the complex traditional deployment method with the streamlined Runpod Flash workflow

The typical path starts with writing a Dockerfile that specifies a base OS, installs system dependencies, sets up a Python environment, and installs packages. That image is then built locally, pushed to a container registry such as Docker Hub or Amazon ECR, and finally referenced in a cloud deployment configuration. For Kubernetes, this means writing YAML manifests, managing secrets, and defining scaling policies.

The entire process shifts developer attention from application logic to infrastructure management. It is slow, error-prone, and assumes familiarity with containerization and cloud operations that many ML and data science developers do not have.

How Flash works

Flash abstracts the infrastructure layer entirely.

Runpod blog excerpt reading "No Docker. No container orchestration. Just Python."

When a decorated function is called, Flash inspects the function and its declared dependencies, packages the code, sends it to Runpod's serverless platform, provisions a worker with the requested hardware, installs dependencies, runs the function, returns the result, and shuts down the worker. Billing covers only the execution time.

Automatic environment synchronization

Flash handles cross-platform differences automatically. Code written on macOS is correctly compiled and installed for the target Linux GPU environment at execution time, without any manual intervention.

Runpod dashboard logs showing the automatic installation of Python dependencies on the remote worker

Independent hardware per function

Each decorated function is a distinct endpoint and can be assigned different hardware. A pre-processing function that does not need a GPU can run on a CPU worker at lower cost. A generation function that requires a high-end GPU can specify one explicitly. If multiple requests arrive simultaneously, Flash spins up parallel workers for each, so throughput scales with demand without a separate queueing system.

Each endpoint is also a live callable API, which means GPU functions can be triggered from a web backend, a mobile app, or any other HTTP client without additional setup.

A multi-stage image-to-video pipeline

The following example builds a two-stage pipeline that takes a static image and a text prompt and returns an AI-generated video using the CogVideoX 5B model. Stage one handles image pre-processing on a CPU worker. Stage two handles video generation on a GPU worker. The output of stage one is passed directly as input to stage two.

Project setup

Create a project directory and initialize a virtual environment using uv:

 
mkdir runpod-flash-test
 
cd runpod-flash-test
 
uv init
 
uv add runpod-flash
 
source .venv/bin/activate

Authenticate with Runpod:

 
flash login

This prints a URL to open in the browser for authorization. After completing that step, the CLI is authenticated.

Stage 1: image preparation on a CPU worker

The first function downloads an image from a URL, resizes it to the 720x480 format expected by CogVideoX, and returns it as a base64-encoded string. This task needs no GPU, so the decorator specifies a CPU configuration.

run.py
import asyncio
import base64
import sys
import time
from runpod_flash import Endpoint, GpuType

@Endpoint(
    name="image-prepper",
    cpu="cpu5c-4-8",
    dependencies=["Pillow", "requests"]
)
async def prepare_adaptive_image(url: str):
    import requests
    from PIL import Image
    import io

    response = requests.get(url)
    img = Image.open(io.BytesIO(response.content)).convert("RGB")
    img = img.resize((720, 480))

    buffered = io.BytesIO()
    img.save(buffered, format="JPEG")

    raw_b64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return {"raw_b64": raw_b64}

The @Endpoint decorator parameters control how Flash provisions the worker. name sets the label shown in the Runpod dashboard. cpu selects a specific CPU type rather than a GPU. dependencies lists the packages Flash installs on the remote worker before running the function.

Stage 2: video generation on a GPU worker

The second function receives the prepared image and a text prompt, runs the CogVideoX image-to-video pipeline, and returns the resulting video as a base64-encoded string.

Close-up of the @Endpoint decorator code highlighting the GPU specification

run.py
@Endpoint(
    name="cogvideo-worker",
    gpu=[GpuType.NVIDIA_GEFORCE_RTX_5090],
    dependencies=[
        "torch",
        "diffusers",
        "accelerate",
        "transformers>=4.44.0",
        "opencv-python",
        "Pillow",
        "sentencepiece",
        "tiktoken",
        "protobuf",
        "decord",
        "imageio[ffmpeg]",
        "imageio-ffmpeg",
    ]
)
async def generate_multimodal_video(image_b64: str, prompt: str):
    import torch
    import io
    import base64
    import tempfile
    import os
    from diffusers import CogVideoXImageToVideoPipeline
    from diffusers.utils import export_to_video
    from PIL import Image

    pipe = CogVideoXImageToVideoPipeline.from_pretrained(
        "THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16
    )

    pipe.enable_model_cpu_offload()
    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()

    image_data = base64.b64decode(image_b64)
    img = Image.open(io.BytesIO(image_data)).convert("RGB")

    video_frames = pipe(
        prompt=prompt,
        image=img,
        num_videos_per_prompt=1,
        num_inference_steps=50,
        num_frames=50,
        guidance_scale=3.0,
    ).frames[0]

    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp:
        export_to_video(video_frames, tmp.name, fps=8)
        tmp.seek(0)
        video_b64 = base64.b64encode(tmp.read()).decode("utf-8")

    os.remove(tmp.name)
    return {"video_b64": video_b64}

gpu=[GpuType.NVIDIA_GEFORCE_RTX_5090] requests a specific GPU. The memory optimization calls (enable_model_cpu_offload, enable_slicing, enable_tiling) prevent out-of-memory errors during VAE decoding on the 5B parameter model.

Orchestrating the pipeline

The main function calls the two workers in sequence, passing the output of the first directly to the second.

Code for the main function with arrows indicating data flow from the CPU worker to the GPU worker

run.py
async def main():
    url = sys.argv[1] if len(sys.argv) > 1 else input("Enter Image URL: ")
    prompt = sys.argv[2] if len(sys.argv) > 2 else input("Describe the action: ")

    print("--- Launching CogVideoX Multimodal Pipeline ---")
    start = time.time()

    prep = await prepare_adaptive_image(url)
    video_b64 = await generate_multimodal_video(prep["raw_b64"], prompt)

    with open("cogvideo_output.mp4", "wb") as f:
        f.write(base64.b64decode(video_b64["video_b64"]))

    print(f"Success! Video saved in {round(time.time() - start, 2)}s.")


if __name__ == "__main__":
    asyncio.run(main())

The orchestration is two await calls. Flash handles the network communication, data transfer between workers, and hardware provisioning for each stage without any additional configuration.

Running the pipeline

 
uv run python run.py "https://images.stockcake.com/public/9/5/d/95df2597-5deb-4175-ae94-ef94d88cc994/adorable-cartoon-puppy-stockcake.jpg" "a happy puppy jumping around happily"

The first run takes several minutes because Flash builds the remote environments and downloads the CogVideoX model weights. Subsequent runs are faster because the worker environments are cached. Progress is streamed to the terminal as Flash queues and executes each stage.

Terminal window streaming live progress logs from the Runpod Flash execution

When complete, cogvideo_output.mp4 appears in the project directory.

Terminal showing the success message alongside the output file and the generated video of a jumping dog

Monitoring endpoints in the dashboard

The Runpod dashboard shows both endpoints under Recent Endpoints, where their status (running, idle, or queued) is visible at a glance. Drilling into an endpoint shows its logs, which is useful for confirming that dependencies installed correctly or diagnosing failures. The Metrics tab provides graphs of request volume, execution time, and cold start counts over time.

Runpod metrics dashboard showing a bar chart of completed, failed, and retried job requests over time

Final thoughts

Flash's value is most apparent in projects that mix hardware requirements across stages. assigning lightweight tasks to CPU workers and heavy inference to GPU workers is a straightforward cost optimization that would require significant orchestration overhead in a traditional deployment. With Flash, it is a single decorator parameter.

The main tradeoff is cold start latency. The first invocation of a function, especially one with heavy dependencies like torch and diffusers, takes several minutes while the worker environment is built. For latency-sensitive production workloads this matters, but for batch jobs, pipelines, and development iteration it is generally acceptable.

Full documentation and supported hardware options are available on the Runpod Flash documentation site.

Got an article suggestion? Let us know
Licensed under CC-BY-NC-SA

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.