Which models can be optimized with ONNX Runtime?

ONNX Runtime supports any model exported to ONNX format: BERT, RoBERTa, LLaMA, Mistral, Phi-3, and many others. Both encoders and generative models are suitable.

How much does inference speed up on CPU after quantization?

INT8 dynamic quantization provides 2–4x speedup on CPU with minimal accuracy loss (<1%). INT4 can achieve up to 10x speedup but requires advanced calibration.

Is a GPU required to use ONNX Runtime?

No. ONNX Runtime is optimized for CPU (AVX512, ARM) and GPU (CUDA, DirectML, ROCm). You can use only CPU, especially for models up to 7B parameters.

How to integrate ONNX Runtime into an existing pipeline?

Simply replace the Hugging Face inference call with an ONNX Runtime session. We prepare the converted model and session configuration for your task. Integration takes 1–3 days.

What guarantees do you provide for optimization?

We guarantee at least a 2x speedup on CPU and accuracy preservation within 1%. We provide a test report and post-deployment support.

Which models can be optimized with ONNX Runtime?

ONNX Runtime supports any model exported to ONNX format: BERT, RoBERTa, LLaMA, Mistral, Phi-3, and many others. Both encoders and generative models are suitable.

How much does inference speed up on CPU after quantization?

INT8 dynamic quantization provides 2–4x speedup on CPU with minimal accuracy loss (<1%). INT4 can achieve up to 10x speedup but requires advanced calibration.

Is a GPU required to use ONNX Runtime?

No. ONNX Runtime is optimized for CPU (AVX512, ARM) and GPU (CUDA, DirectML, ROCm). You can use only CPU, especially for models up to 7B parameters.

How to integrate ONNX Runtime into an existing pipeline?

Simply replace the Hugging Face inference call with an ONNX Runtime session. We prepare the converted model and session configuration for your task. Integration takes 1–3 days.

What guarantees do you provide for optimization?

We guarantee at least a 2x speedup on CPU and accuracy preservation within 1%. We provide a test report and post-deployment support.

ONNX Runtime for LLMs: Practical Acceleration Guide

Q: Is a GPU required to use ONNX Runtime?

No. ONNX Runtime is optimized for CPU (AVX512, ARM) and GPU (CUDA, DirectML, ROCm). You can use only CPU, especially for models up to 7B parameters.

Q: How to integrate ONNX Runtime into an existing pipeline?

Simply replace the Hugging Face inference call with an ONNX Runtime session. We prepare the converted model and session configuration for your task. Integration takes 1–3 days.

Q: What guarantees do you provide for optimization?

We guarantee at least a 2x speedup on CPU and accuracy preservation within 1%. We provide a test report and post-deployment support.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1566 services

ONNX Runtime for LLMs: Practical Acceleration Guide

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1317
Development of a web application for FEEDME
1226
Website development for BELFINGROUP
925
Development of an online store for the company FURNORO
1156
B2B Advance company logo design
620
Development of a web application for Enviok
894

Show more works

Practical Guide to Accelerating LLMs with ONNX Runtime on CPU and GPU

Your LLaMA 7B model outputs 5 tokens per second on Intel Xeon, but you need at least 20 for real-time performance. Stack: PyTorch on CPU, Hugging Face transformers—but latency doesn't cut it. No GPU, cloud instances are expensive. We've faced this many times. The solution is ONNX Runtime (ORT), an inference engine from Microsoft optimized for modern ISA (AVX512, VNNI) and heterogeneous systems. According to Microsoft, ORT provides up to 4x speedup on CPU. With our experience (over 5 years in ML optimization, 30+ projects), we tune ORT for your task to achieve 20+ tokens/s even on Xeon. For one client with a LLaMA 7B model on Intel Xeon Platinum, we increased throughput from 5 to 22 tokens/s—a 4.4x speedup after INT8 quantization and session optimization. Estimated annual cloud cost savings exceeded $15,000. Typical project costs range from $1,500 to $5,000, but cloud GPU savings often exceed $2,000 per month, yielding ROI within weeks.

When is ONNX Runtime the Right Choice?

ORT excels in three scenarios: (1) CPU inference for models ≤7B—thanks to AVX512 and INT8 quantization; (2) edge devices (ARM, Windows on ARM); (3) mixed cloud and compliance where NVIDIA GPUs are unavailable. Even for small encoders (BERT, RoBERTa), ORT gives 2–4x speedup without accuracy loss. INT4 quantization (weight-only, per-channel) provides up to 10x speedup on CPU, which we confirmed in tests with Phi-3-mini. ONNX Runtime with INT4 outperforms vanilla PyTorch FP32 by over 10x on CPU.

Converting a Model to ONNX

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.exporters.onnx import main_export

# Export via Optimum (recommended)
main_export(
    model_name_or_path="cardiffnlp/twitter-roberta-base-sentiment",
    output="./onnx_model/",
    task="text-classification",
    opset=17,
    device="cuda",    # export with GPU for better optimization
    fp16=True         # half precision
)

For custom models, we use torch.onnx.export with dynamic axes—this allows handling variable sequence lengths without re-export.

How to Perform INT8 Quantization for CPU?

INT8 quantization is the key technique for CPU acceleration. We apply two approaches:

Dynamic Quantization—no calibration data needed, suitable for quick experiments. Result on Intel Xeon: BERT-base 8ms → 3ms with <1% accuracy degradation.

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8,
    per_channel=True,
    reduce_range=True
)

Static Quantization with calibration—gives an additional 10–15% speedup but requires 100–500 representative examples. We prepare the calibration dataset from your data.

from onnxruntime.quantization import quantize_static, CalibrationDataReader

class SentimentCalibrationDataReader(CalibrationDataReader):
    def __init__(self, calibration_texts: list[str], tokenizer):
        self.tokenizer = tokenizer
        self.data = iter(calibration_texts)

    def get_next(self) -> dict | None:
        text = next(self.data, None)
        if text is None:
            return None
        inputs = self.tokenizer(text, return_tensors="np", padding="max_length",
                                max_length=128, truncation=True)
        return dict(inputs)

calibration_reader = SentimentCalibrationDataReader(
    calibration_texts=load_calibration_data(),
    tokenizer=tokenizer
)

quantize_static(
    model_input="model.onnx",
    model_output="model_int8_static.onnx",
    calibration_data_reader=calibration_reader,
    quant_format=QuantFormat.QDQ,
    per_channel=True
)

Quantization Method	Speedup	Accuracy Degradation	Complexity
Dynamic INT8	2–3x	<1%	Low
Static INT8	3–4x	<0.5%	Medium
INT4 (via AWQ/GPTQ)	4–6x	<2%	High

How to Configure ONNX Runtime Session for Maximum Performance?

Proper session configuration is critical. We set graph optimization levels, thread counts, and providers.

import onnxruntime as ort
import numpy as np

session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.intra_op_num_threads = 8
session_options.inter_op_num_threads = 2
session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

providers = [
    ("CUDAExecutionProvider", {
        "device_id": 0,
        "arena_extend_strategy": "kNextPowerOfTwo",
        "gpu_mem_limit": 4 * 1024 ** 3,
        "cudnn_conv_algo_search": "EXHAUSTIVE",
    }),
    "CPUExecutionProvider"
]

session = ort.InferenceSession(
    "model_int8.onnx",
    sess_options=session_options,
    providers=providers
)

def predict_batch(texts: list[str]) -> list[dict]:
    inputs = tokenizer(
        texts, padding=True, truncation=True,
        max_length=128, return_tensors="np"
    )
    outputs = session.run(None, dict(inputs))
    logits = outputs[0]
    probs = softmax(logits, axis=1)
    return [
        {"label": LABELS[np.argmax(p)], "score": float(np.max(p))}
        for p in probs
    ]

ONNX Runtime for LLMs: onnxruntime-genai

For generative models (Phi-3, LLaMA, Mistral), Microsoft released the onnxruntime-genai package. It supports sampling, beam search, and KV-cache.

import onnxruntime_genai as og

params = og.GeneratorParams(og.Model("./phi3-mini-onnx/"))
params.set_search_options(max_length=200, temperature=0.7)
params.input_ids = tokenizer.encode("Explain ONNX Runtime")

generator = og.Generator(model, params)
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()
    token = generator.get_next_tokens()[0]
    print(tokenizer.decode([token]), end="", flush=True)

CPU Performance: Test Results

On Intel Xeon Platinum 8375C (32 cores), Phi-3-mini (3.8B), batch=8:

Configuration	Throughput
HF transformers (FP32)	12 tok/s
ORT (FP32)	28 tok/s
ORT (INT8 dynamic)	67 tok/s
ORT (INT4)	124 tok/s

Comparing ORT with INT4 to vanilla HF FP32, throughput is over 10x higher—a clear demonstration of ORT's superiority for CPU inference.

What We Provide

We deliver a comprehensive optimization service: pipeline audit (latency, p99, throughput), model conversion to ONNX with optimal opset selection, quantization (INT8/INT4) with calibration on your data, session and provider tuning for your hardware, testing on a validation set (speedup and accuracy report), documentation and integration into your CI/CD, team training (1–2 days), and 1 month of support.

Our Approach

Analysis: You send the model and requirements (latency, hardware). We estimate potential speedup.
Design: Choose methods (INT8/INT4/FP16), configure export.
Implementation: Write conversion and inference code, test on reference data.
Testing: A/B test with original pipeline, metric measurements.
Deployment: Integrate into your environment, train the team.

Timelines and How to Start

Timelines range from 5 to 15 business days depending on model complexity and data volume. Cost is determined individually, but cloud GPU savings can reach 70%. Want similar speedup? Contact us—we'll audit your pipeline and propose the optimal solution. Request a consultation, and within a day you'll receive a preliminary speedup estimate.

We guarantee at least a 2x speedup on CPU (or a refund) and provide a test certificate.