Skip to main content

Introduction

What is WoolyAI?​

WoolyAI is a suite of software that helps companies simplify and maximize GPU hardware utilization, as well as reduce costs. While not completely accurate, you can think of WoolyAI as a GPU hypervisor for ML platforms.

  • Dynamic scheduling & allocation: Measures and allocates GPU cores and VRAM at runtime across multiple simultaneous requests. Includes deterministic scheduling options.
  • Memory efficiency: VRAM dedup (e.g., shared base model weights across many LoRA adapters) to pack more models per GPU.
  • Ecosystem-friendly: Works with your existing PyTorch scripts and models - no code rewrites or porting required.
  • Works with existing pods: No special container images needed.

The Problem​

Hardware is expensive and often underutilized. Instead of statically assigning fixed GPU resources upfront, WoolyAI makes allocation decisions in real-time based on actual usage. No coarse-grained time-slicing.

In short: WoolyAI is the "traffic controller" for GPU resources. Rather than giving each user their own dedicated GPU (expensive and wasteful), WoolyAI measures what each workload actually needs and intelligently shares GPU cores and memory across many users in real-time. Letting teams of 20-50 researchers share a small GPU pool.

Advantages​

  • Lower Infrastructure Costs: Maximize utilization per GPU and reduce costs by allowing less GPU hardware to do more.
  • True GPU Concurrency: Runs multiple kernel executions in a single GPU context without time-slicing overhead, unlike traditional static partitioning (MIG/MPS) that create rigid, underutilized segments.
  • Dynamic Resource Allocation: Real-time redistribution of GPU cores and VRAM based on active kernel processes, priority levels, and actual usage patterns - not fixed quotas.
  • Maximized GPU Utilization: Eliminates idle cycles by continuously analyzing and optimizing resource distribution, ensuring no GPU compute sits unused.
  • Memory Sharing: Deduplicates VRAM across multiple clients to save on expensive GPU memory. Share identical models in VRAM across multiple workloads.

Components​

WoolyAI is a software platform consisting of two main components that can be deployed and used as a service in your organization, supporting multiple teams and models.

  1. WoolyAI GPU Operator: Deployed on Kubernetes, allowing NVIDIA pod deployment with WoolyAI libraries pre-included and configured.

  2. WoolyAI Server: Note: Automatically deployed through the WoolyAI GPU Operator. A container that runs on your GPU hosts. It measures and allocates GPU cores and VRAM at runtime across multiple simultaneous pod requests. Includes deterministic scheduling options and deduplicates VRAM across multiple clients to save on expensive GPU memory.