Runpod Flash: deploying serverless GPU functions from Python
Runpod Flash is a Python SDK that turns decorated functions into serverless GPU endpoints. It removes the need to write a Dockerfile, push to a container registry, or configure a cloud deployment. A function decorated with @Endpoint is automatically packaged, deployed to Runpod's infrastructure, executed on the requested hardware, and torn down when the job completes.
The traditional GPU deployment workflow
Deploying a Python script that requires GPU acceleration in the cloud has historically involved several steps that have nothing to do with the application logic itself.
The typical path starts with writing a Dockerfile that specifies a base OS, installs system dependencies, sets up a Python environment, and installs packages. That image is then built locally, pushed to a container registry such as Docker Hub or Amazon ECR, and finally referenced in a cloud deployment configuration. For Kubernetes, this means writing YAML manifests, managing secrets, and defining scaling policies.
The entire process shifts developer attention from application logic to infrastructure management. It is slow, error-prone, and assumes familiarity with containerization and cloud operations that many ML and data science developers do not have.
How Flash works
Flash abstracts the infrastructure layer entirely.
When a decorated function is called, Flash inspects the function and its declared dependencies, packages the code, sends it to Runpod's serverless platform, provisions a worker with the requested hardware, installs dependencies, runs the function, returns the result, and shuts down the worker. Billing covers only the execution time.
Automatic environment synchronization
Flash handles cross-platform differences automatically. Code written on macOS is correctly compiled and installed for the target Linux GPU environment at execution time, without any manual intervention.
Independent hardware per function
Each decorated function is a distinct endpoint and can be assigned different hardware. A pre-processing function that does not need a GPU can run on a CPU worker at lower cost. A generation function that requires a high-end GPU can specify one explicitly. If multiple requests arrive simultaneously, Flash spins up parallel workers for each, so throughput scales with demand without a separate queueing system.
Each endpoint is also a live callable API, which means GPU functions can be triggered from a web backend, a mobile app, or any other HTTP client without additional setup.
A multi-stage image-to-video pipeline
The following example builds a two-stage pipeline that takes a static image and a text prompt and returns an AI-generated video using the CogVideoX 5B model. Stage one handles image pre-processing on a CPU worker. Stage two handles video generation on a GPU worker. The output of stage one is passed directly as input to stage two.
Project setup
Create a project directory and initialize a virtual environment using uv:
Authenticate with Runpod:
This prints a URL to open in the browser for authorization. After completing that step, the CLI is authenticated.
Stage 1: image preparation on a CPU worker
The first function downloads an image from a URL, resizes it to the 720x480 format expected by CogVideoX, and returns it as a base64-encoded string. This task needs no GPU, so the decorator specifies a CPU configuration.
The @Endpoint decorator parameters control how Flash provisions the worker. name sets the label shown in the Runpod dashboard. cpu selects a specific CPU type rather than a GPU. dependencies lists the packages Flash installs on the remote worker before running the function.
Stage 2: video generation on a GPU worker
The second function receives the prepared image and a text prompt, runs the CogVideoX image-to-video pipeline, and returns the resulting video as a base64-encoded string.
gpu=[GpuType.NVIDIA_GEFORCE_RTX_5090] requests a specific GPU. The memory optimization calls (enable_model_cpu_offload, enable_slicing, enable_tiling) prevent out-of-memory errors during VAE decoding on the 5B parameter model.
Orchestrating the pipeline
The main function calls the two workers in sequence, passing the output of the first directly to the second.
The orchestration is two await calls. Flash handles the network communication, data transfer between workers, and hardware provisioning for each stage without any additional configuration.
Running the pipeline
The first run takes several minutes because Flash builds the remote environments and downloads the CogVideoX model weights. Subsequent runs are faster because the worker environments are cached. Progress is streamed to the terminal as Flash queues and executes each stage.
When complete, cogvideo_output.mp4 appears in the project directory.
Monitoring endpoints in the dashboard
The Runpod dashboard shows both endpoints under Recent Endpoints, where their status (running, idle, or queued) is visible at a glance. Drilling into an endpoint shows its logs, which is useful for confirming that dependencies installed correctly or diagnosing failures. The Metrics tab provides graphs of request volume, execution time, and cold start counts over time.
Final thoughts
Flash's value is most apparent in projects that mix hardware requirements across stages. assigning lightweight tasks to CPU workers and heavy inference to GPU workers is a straightforward cost optimization that would require significant orchestration overhead in a traditional deployment. With Flash, it is a single decorator parameter.
The main tradeoff is cold start latency. The first invocation of a function, especially one with heavy dependencies like torch and diffusers, takes several minutes while the worker environment is built. For latency-sensitive production workloads this matters, but for batch jobs, pipelines, and development iteration it is generally acceptable.
Full documentation and supported hardware options are available on the Runpod Flash documentation site.