Optimizing LLM Inference with ONNX Runtime
ONNX Runtime (ORT) is a universal inference engine from Microsoft that runs on CPUs, GPUs, and specialized accelerators (DirectML, TensorRT EP, ROCm). It's an optimal choice for environments without NVIDIA GPUs or when cross-platform deployment is required.
When ONNX Runtime is the Right Choice
- CPU inference: For small and medium models (≤ 7B) without a GPU, ORT is significantly faster than HF transformers on the CPU via AVX512 optimizations
- Edge and embedded: ARM devices, Azure IoT, Windows on ARM
- Mixed cloud: deploy to different cloud providers with different GPUs
- Compliance: environments without NVIDIA GPU (AMD, Intel GPU, CPU-only)
- Small encoder models: BERT, RoBERTa for classification and NER - ORT gives 2-4x speedup
Converting a model to ONNX
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.exporters.onnx import main_export
# Экспорт через Optimum (рекомендуется)
main_export(
model_name_or_path="cardiffnlp/twitter-roberta-base-sentiment",
output="./onnx_model/",
task="text-classification",
opset=17,
device="cuda", # экспорт с GPU для лучшей оптимизации
fp16=True # половинная точность
)
# Альтернатива: torch.onnx.export для кастомных моделей
import torch
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
dummy_input = {
"input_ids": torch.ones(1, 128, dtype=torch.long),
"attention_mask": torch.ones(1, 128, dtype=torch.long),
}
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=17,
input_names=["input_ids", "attention_mask"],
output_names=["logits"],
dynamic_axes={
"input_ids": {0: "batch", 1: "seq_len"},
"attention_mask": {0: "batch", 1: "seq_len"},
}
)
INT8 Quantization for CPU
from onnxruntime.quantization import quantize_dynamic, QuantType
# Dynamic quantization — самый простой способ, не требует calibration data
quantize_dynamic(
model_input="model.onnx",
model_output="model_int8.onnx",
weight_type=QuantType.QInt8,
per_channel=True,
reduce_range=True # для старых Intel CPU
)
Result on Intel Xeon: BERT-base inference 8ms → 3ms with <1% accuracy degradation on GLUE.
Static INT8 with calibration
from onnxruntime.quantization import quantize_static, CalibrationDataReader
class SentimentCalibrationDataReader(CalibrationDataReader):
def __init__(self, calibration_texts: list[str], tokenizer):
self.tokenizer = tokenizer
self.data = iter(calibration_texts)
def get_next(self) -> dict | None:
text = next(self.data, None)
if text is None:
return None
inputs = self.tokenizer(text, return_tensors="np", padding="max_length",
max_length=128, truncation=True)
return dict(inputs)
calibration_reader = SentimentCalibrationDataReader(
calibration_texts=load_calibration_data(), # 100-500 representativesamples
tokenizer=tokenizer
)
quantize_static(
model_input="model.onnx",
model_output="model_int8_static.onnx",
calibration_data_reader=calibration_reader,
quant_format=QuantFormat.QDQ,
per_channel=True
)
Optimized inference
import onnxruntime as ort
import numpy as np
# Конфигурация сессии
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.intra_op_num_threads = 8 # параллелизм внутри оператора
session_options.inter_op_num_threads = 2 # параллелизм между операторами
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
# Провайдеры: порядок важен (используется первый доступный)
providers = [
("CUDAExecutionProvider", {
"device_id": 0,
"arena_extend_strategy": "kNextPowerOfTwo",
"gpu_mem_limit": 4 * 1024 ** 3, # 4 GB
"cudnn_conv_algo_search": "EXHAUSTIVE",
}),
"CPUExecutionProvider"
]
session = ort.InferenceSession(
"model_int8.onnx",
sess_options=session_options,
providers=providers
)
def predict_batch(texts: list[str]) -> list[dict]:
inputs = tokenizer(
texts, padding=True, truncation=True,
max_length=128, return_tensors="np"
)
outputs = session.run(None, dict(inputs))
logits = outputs[0]
probs = softmax(logits, axis=1)
return [
{"label": LABELS[np.argmax(p)], "score": float(np.max(p))}
for p in probs
]
ONNX Runtime for LLM: onnxruntime-genai
For generative models, Microsoft has developed a specialized package:
import onnxruntime_genai as og
# Загрузка предоптимизированной модели (phi-3, llama, mistral)
model = og.Model("./phi3-mini-onnx/")
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
params.set_search_options(max_length=200, temperature=0.7)
params.input_ids = tokenizer.encode("Explain ONNX Runtime")
generator = og.Generator(model, params)
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
token = generator.get_next_tokens()[0]
print(tokenizer.decode([token]), end="", flush=True)
CPU performance
On Intel Xeon Platinum 8375C (32 cores), Phi-3-mini (3.8B), batch=8:
| Configuration | Throughput |
|---|---|
| HF transformers (FP32) | 12 tok/s |
| ORT (FP32) | 28 tok/s |
| ORT (INT8 dynamic) | 67 tok/s |
| ORT (INT4) | 124 tok/s |







