LLM Inference Optimization with ONNX Runtime

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Inference Optimization with ONNX Runtime
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1214
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823

Optimizing LLM Inference with ONNX Runtime

ONNX Runtime (ORT) is a universal inference engine from Microsoft that runs on CPUs, GPUs, and specialized accelerators (DirectML, TensorRT EP, ROCm). It's an optimal choice for environments without NVIDIA GPUs or when cross-platform deployment is required.

When ONNX Runtime is the Right Choice

  • CPU inference: For small and medium models (≤ 7B) without a GPU, ORT is significantly faster than HF transformers on the CPU via AVX512 optimizations
  • Edge and embedded: ARM devices, Azure IoT, Windows on ARM
  • Mixed cloud: deploy to different cloud providers with different GPUs
  • Compliance: environments without NVIDIA GPU (AMD, Intel GPU, CPU-only)
  • Small encoder models: BERT, RoBERTa for classification and NER - ORT gives 2-4x speedup

Converting a model to ONNX

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.exporters.onnx import main_export

# Экспорт через Optimum (рекомендуется)
main_export(
    model_name_or_path="cardiffnlp/twitter-roberta-base-sentiment",
    output="./onnx_model/",
    task="text-classification",
    opset=17,
    device="cuda",    # экспорт с GPU для лучшей оптимизации
    fp16=True         # половинная точность
)
# Альтернатива: torch.onnx.export для кастомных моделей
import torch

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
dummy_input = {
    "input_ids": torch.ones(1, 128, dtype=torch.long),
    "attention_mask": torch.ones(1, 128, dtype=torch.long),
}

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=17,
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch", 1: "seq_len"},
        "attention_mask": {0: "batch", 1: "seq_len"},
    }
)

INT8 Quantization for CPU

from onnxruntime.quantization import quantize_dynamic, QuantType

# Dynamic quantization — самый простой способ, не требует calibration data
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8,
    per_channel=True,
    reduce_range=True    # для старых Intel CPU
)

Result on Intel Xeon: BERT-base inference 8ms → 3ms with <1% accuracy degradation on GLUE.

Static INT8 with calibration

from onnxruntime.quantization import quantize_static, CalibrationDataReader

class SentimentCalibrationDataReader(CalibrationDataReader):
    def __init__(self, calibration_texts: list[str], tokenizer):
        self.tokenizer = tokenizer
        self.data = iter(calibration_texts)

    def get_next(self) -> dict | None:
        text = next(self.data, None)
        if text is None:
            return None
        inputs = self.tokenizer(text, return_tensors="np", padding="max_length",
                                max_length=128, truncation=True)
        return dict(inputs)

calibration_reader = SentimentCalibrationDataReader(
    calibration_texts=load_calibration_data(),  # 100-500 representativesamples
    tokenizer=tokenizer
)

quantize_static(
    model_input="model.onnx",
    model_output="model_int8_static.onnx",
    calibration_data_reader=calibration_reader,
    quant_format=QuantFormat.QDQ,
    per_channel=True
)

Optimized inference

import onnxruntime as ort
import numpy as np

# Конфигурация сессии
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.intra_op_num_threads = 8       # параллелизм внутри оператора
session_options.inter_op_num_threads = 2       # параллелизм между операторами
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

# Провайдеры: порядок важен (используется первый доступный)
providers = [
    ("CUDAExecutionProvider", {
        "device_id": 0,
        "arena_extend_strategy": "kNextPowerOfTwo",
        "gpu_mem_limit": 4 * 1024 ** 3,  # 4 GB
        "cudnn_conv_algo_search": "EXHAUSTIVE",
    }),
    "CPUExecutionProvider"
]

session = ort.InferenceSession(
    "model_int8.onnx",
    sess_options=session_options,
    providers=providers
)

def predict_batch(texts: list[str]) -> list[dict]:
    inputs = tokenizer(
        texts, padding=True, truncation=True,
        max_length=128, return_tensors="np"
    )
    outputs = session.run(None, dict(inputs))
    logits = outputs[0]
    probs = softmax(logits, axis=1)
    return [
        {"label": LABELS[np.argmax(p)], "score": float(np.max(p))}
        for p in probs
    ]

ONNX Runtime for LLM: onnxruntime-genai

For generative models, Microsoft has developed a specialized package:

import onnxruntime_genai as og

# Загрузка предоптимизированной модели (phi-3, llama, mistral)
model = og.Model("./phi3-mini-onnx/")
tokenizer = og.Tokenizer(model)

params = og.GeneratorParams(model)
params.set_search_options(max_length=200, temperature=0.7)
params.input_ids = tokenizer.encode("Explain ONNX Runtime")

generator = og.Generator(model, params)
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()
    token = generator.get_next_tokens()[0]
    print(tokenizer.decode([token]), end="", flush=True)

CPU performance

On Intel Xeon Platinum 8375C (32 cores), Phi-3-mini (3.8B), batch=8:

Configuration Throughput
HF transformers (FP32) 12 tok/s
ORT (FP32) 28 tok/s
ORT (INT8 dynamic) 67 tok/s
ORT (INT4) 124 tok/s