Table of Contents
- Overview
- Installation
- Core Concepts
@cuda
Decorator- Signature and Parameters
- Behavior
- Example Usage
@cuda_advanced
Decorator- Signature and Parameters
- Advanced Features
- Example Usage
DeviceContext
Context Manager- Signature and Parameters
- Auto Tensorization
- AMP Integration
- Example Usage
- Utility Functions (
utils.py
)tensorize_for_universal
move_to_torch
patch_numpy_with_cupy
- Neural Network Example
- Benchmark and Profiling
- License
Overview
universal-cuda-tools
is a Python toolkit that works with PyTorch and optionally TensorFlow. It automatically tensorizes raw Python, NumPy, or CuPy data, manages GPU/CPU device placement, and wraps advanced features like mixed precision (AMP), timeouts, retries, VRAM checks, memory profiling, live dashboard, and dry-run mode into a single utility package.
Key Features
@cuda
: Lightweight, device-aware wrapper with auto-tensorization, retry, and fallback@cuda_advanced
: Fully featured decorator with timeout, AMP, multi-GPU, error callbacks, telemetry, live dashboard, and dry-runDeviceContext
: Block-scope control over device, AMP, and tensor conversion- utils: Includes
tensorize_for_universal
, move_to_torch
, patch_numpy_with_cupy
Installation
1
| pip install universal-cuda-tools
|
or from source:
1
2
3
4
| git clone https://github.com/tunahanyrd/universal-cuda-tools.git
cd universal-cuda-tools
python -m build
pip install dist/universal_cuda_tools-<version>.whl
|
Core Concepts
- Device:
"cpu"
, "cuda"
, "cuda:0"
, etc. - Tensorize: Convert raw Python/NumPy data into
torch.Tensor
- Auto-tensorize: Automatic conversion using decorator or context
- OOM Fallback: Fallback to CPU if GPU memory is insufficient
- Mixed Precision (AMP): Use
torch.autocast
to accelerate with FP16
@cuda
Decorator
Signature and Parameters
1
2
3
4
5
6
7
8
9
| def cuda(func=None, *,
device=None,
verbose=True,
clear_cache=False,
retry=0,
min_free_vram=None,
auto_tensorize=False,
to_list=False):
...
|
device
(str
): 'cuda'
, 'cuda:0'
, 'cpu'
, or None
for automaticverbose
(bool
): Print logs to INFO levelclear_cache
(bool
): Run torch.cuda.empty_cache()
before callretry
(int
): Retry count on errormin_free_vram
(float
): Minimum VRAM in GB, else RuntimeErrorauto_tensorize
(bool
): Convert Python/NumPy to torch.Tensor(device)
to_list
(bool
): Convert tensor output to native Python list
Behavior
- Optionally converts args to tensors via
auto_tensorize
- Moves tensor/NumPy inputs to the target
device
- Calls function; retries on error up to
retry
- If still out of memory, falls back to CPU
- Logs memory usage if
verbose
is enabled - Returns output as list if
to_list=True
Example Usage
1
2
3
4
5
6
7
8
9
| from cuda_tools import cuda
import numpy as np
@cuda(device="cuda", auto_tensorize=True, to_list=True, verbose=True)
def vector_add(a, b):
return a + b
result = vector_add([1,2,3], np.array([4,5,6]))
print(result) # [5,7,9]
|
@cuda_advanced
Decorator
Signature and Parameters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| def cuda_advanced(func=None, *,
device=None,
verbose=True,
clear_cache=False,
retry=0,
min_free_vram=None,
auto_tensorize=False,
to_list=False,
timeout=None,
use_amp=False,
mgpu=False,
error_callback=None,
telemetry=False,
memory_profiler=True,
live_dashboard=False,
dry_run=False):
...
|
timeout
(float
): Timeout in secondsuse_amp
(bool
): Enable mixed precision with torch.autocast
mgpu
(bool
): Use least-loaded GPU if availableerror_callback
(callable
): Called on errortelemetry
(bool
): Logs device and timing infomemory_profiler
(bool
): Logs memory deltaslive_dashboard
(bool
): Tracks call count and total durationdry_run
(bool
): Skips execution, returns None
Advanced Features
- Timeout raises
TimeoutError
- AMP speeds up training/inference
- Multi-GPU support
- Error callback and early exit
- Dry-run testing without side effects
Example Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
| from cuda_tools import cuda_advanced
import torch, time
@cuda_advanced(timeout=0.5, retry=1, use_amp=True,
telemetry=True, verbose=True)
def train_step(x):
time.sleep(1)
return x * x
try:
train_step(torch.ones(10,10, device="cuda"))
except TimeoutError:
print("Time out!")
|
DeviceContext
Context Manager
Signature and Parameters
1
2
3
4
5
6
7
8
| class DeviceContext:
def __init__(self, device='cuda',
use_amp=False,
verbose=False,
auto_tensorize=False):
...
def __enter__(self) -> torch.device: ...
def __exit__(self, exc_type, exc_val, exc_tb): ...
|
device
(str
): 'cuda'
or 'cpu'
use_amp
(bool
): Enables mixed precisionverbose
(bool
): Print logsauto_tensorize
(bool
): Converts Python/NumPy to torch.Tensor
inside block
Usage
1
2
3
4
5
6
7
| from cuda_tools.context import DeviceContext
import numpy as np
with DeviceContext(device='cuda', auto_tensorize=True, use_amp=True, verbose=True) as dev:
a = tensorize_for_universal(5, dev)
b = tensorize_for_universal(np.array([1,2,3]), dev)
print(a + b, (a+b).device)
|
Utility Functions (utils.py
)
tensorize_for_universal(obj, device)
Converts raw Python (int
, float
), NumPy scalars, arrays, PyTorch or TensorFlow tensors to torch.Tensor(device)
move_to_torch(device, obj)
Moves NumPy or PyTorch tensor to specified device
patch_numpy_with_cupy()
Redirects NumPy calls to CuPy for GPU acceleration
Neural Network Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| import torch
import torch.nn as nn
import torch.optim as optim
from cuda_tools import cuda_advanced
from cuda_tools.context import DeviceContext
class MLP(nn.Module):
def __init__(self, dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, 128), nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x): return self.net(x)
@cuda_advanced(use_amp=True, retry=1, telemetry=True, verbose=True)
def train_step(model, x, y, loss_fn, opt):
opt.zero_grad()
pred = model(x)
loss = loss_fn(pred, y)
loss.backward()
opt.step()
return loss.item()
with DeviceContext(device='cuda', auto_tensorize=True):
model = MLP(20).to('cuda')
opt = optim.SGD(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
x = torch.randn(32, 20, device='cuda')
y = torch.randint(0, 10, (32,), device='cuda')
loss = train_step(model, x, y, loss_fn, opt)
print("Loss:", loss)
|
Benchmark and Profiling
1
2
3
4
5
6
7
8
9
10
11
12
13
| import time
import torch
from cuda_tools import cuda
@cuda(device="cpu")
def cpu_op(x): return x * x
@cuda(device="cuda")
def gpu_op(x): return x * x
x_cpu = torch.randn(1000,1000)
t0 = time.time(); cpu_op(x_cpu); print("CPU:", time.time()-t0)
x_gpu = x_cpu.to('cuda')
t0 = time.time(); gpu_op(x_gpu); print("GPU:", time.time()-t0)
|
Caveats
- Avoid sending tiny scalar ops to GPU (overhead)
- If
to_list=True
, result will be Python-native (no .device
) min_free_vram
too high = RuntimeError- For TensorFlow warnings:
1
| import os; os.environ['TF_CPP_MIN_LOG_LEVEL']='3'
|
License
MIT License © 2025
See LICENSE
file for details