PyTorch 2.6, CUDA 12.6 TensorRT 10.7, Python 3.12
Torch-TensorRT 2.6.0 targets PyTorch 2.6, TensorRT 10.7, and CUDA 12.6, (builds for CUDA 11.8/12.4 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu124). Python versions from 3.9-3.12 are supported. We do not support 3.13 in this release due to TensorRT not supporting that version of Python at this time.
Deprecation notice
The torchscript frontend will be deprecated in v2.6. Specifically, the following usage will no longer be supported and will issue a deprecation warning at runtime if used:
:::py
torch_tensorrt.compile(model, ir="torchscript")
Moving forward, we encourage users to transition to one of the supported options:
:::py
torch_tensorrt.compile(model)
torch_tensorrt.compile(model, ir="dynamo")
torch.compile(model, backend="tensorrt")
Torchscript will continued to be supported as a deployment format via post compilation tracing
:::py
dynamo_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(dynamo_model, inputs=[...])
ts_model(...)
Please refer to the README for more information regarding our deprecation policy.
Cross-OS Compilation
In Torch-TensorRT 2.6 it is now possible to use a Linux host to compile Torch-TensorRT programs for Windows using the torch_tensorrt.cross_compile_for_windows API. These programs use a slightly different serialization format to facilitate this workflow and cannot be run on Linux. Therefore, when calling torch_tensorrt.cross_compile_for_windows expect the program to be saved directly to disk. Developers should then use the torch_tensorrt.load_cross_compiled_exported_program on the Windows target to load the serialized program. Torch-TensorRT programs now include target platform information to verify OS compatibility on deserialization. This in turn has caused an ABI bump for the runtime.
:::py
if load:
# load the saved model in Windows
if platform.system() != "Windows" or platform.machine() != "AMD64":
raise ValueError(
"cross runtime compiled model for windows can only be loaded in Windows system"
)
loaded_model = torchtrt.load_cross_compiled_exported_program(save_path).module()
print(f"model has been successfully loaded from ${save_path}")
# inference
trt_output = loaded_model(input)
print(f"inference result: {trt_output}")
else:
if platform.system() != "Linux" or platform.architecture()[0] != "64bit":
raise ValueError(
"cross runtime compiled model for windows can only be compiled in Linux system"
)
compile_spec = {
"debug": True,
"min_block_size": 1,
}
torchtrt.cross_compile_for_windows(
model, file_path=save_path, inputs=inputs, **compile_spec
)
print(
f"model has been successfully cross compiled and saved in Linux to {args.path}"
)
Runtime Weight Streaming
Weight Streaming in Torch-TensorRT is a memory optimization technique that helps deploy large models on memory-constrained devices by dynamically loading weights as needed during inference, reducing the overall memory footprint and enabling more efficient use of hardware resources. It is an opt-in feature that needs to be enabled at both build time and runtime.
:::py
trt_model = torch_tensorrt.dynamo.compile(
model,
inputs=input_tensors,
enabled_precisions={torch.float32}, # only float32 precision is allowed for strongly typed network
use_explicit_typing=True, # create a strongly typed network
enable_weight_streaming=True, # enable weight streaming
)
Control the weight streaming budget at runtime using the weight streaming context manager
:::py
with torch_tensorrt.runtime.weight_streaming(trt_model) as weight_streaming_ctx:
# Get the total size of streamable weights in the engine
streamable_budget = weight_streaming_ctx.total_device_budget
# Set 50% weight streaming budget
requested_budget = int(streamable_budget * 0.5)
weight_streaming_ctx.device_budget = requested_budget
trt_model(inputs)
Intra-Block CUDAGraphs
We updated CUDAGraphs API to support Intra-Block CUDAGraphs. When a compiled Torch-TensorRT module has graph breaks, previously, only TensorRT blocks could be run with CUDAGraph's optimized kernel launch. With Torch-TensorRT 2.6 the entire graph can be captured and executed in a unified CUDAGraph to minimize kernel launch overhead.
:::py
# Previous API
with torch_tensorrt.runtime.enable_cudagraphs():
torchtrt_model(inputs)
:::py
# New API
with torch_tensorrt.runtime.enable_cudagraphs(torchtrt_model) as cudagraphs_model:
cudagraphs_model(input)
Improvements to Engine Caching
First, there are some API changes.
1) make_refittable was renamed to immutable_weights in preparation for a future release that will default engines to be compiled with the refit feature enabled, allowing for the Torch-TensorRT engine cache to provide maximum benefits.
2) refit_identical_engine_weights was added to specify whether to refit the engine with identical weights;
3) strip_engine_weights was added to specify whether to strip the engine weights.
4) The default disk size for engine caching was expanded to 5GB.
In addition, one of the capabilities of engine caching is to recognize whether two graphs are isomorphic. If a new graph is isomorphic to any previously compiled TensorRT engine, the engine cache will reuse that engine instead of recompiling the graph, thereby avoiding recompilation time. In the previous release, we utilized FxGraphCachePickler.get_hash(new_gm) from PyTorch to calculate hash values which took up a large portion of the total compile time. In this release, we designed a new hash function to get hash values quickly and then determine the isomorphism with ~4x speedup.
C++11 ABI Changes
To keep pace with PyTorch, as of release 2.6, we switched docker images from manylinux to manylinux2_28. In Torch/Torch-TensorRT 2.6, PRE_CXX11_ABI is used for CUDA 11.8 and 12.4, while CXX11_ABI is used for CUDA 12.6. For Torch/Torch-TensorRT 2.7, CXX11_ABI will be used for all CUDA 11.8, 12.4, and 12.6.
Explicit Typing
We introduce a new compilation setting, use_explicit_typing, to enable mixed precision inference with Torch-TensorRT. When this flag is enabled, TensorRT operates in strong typing mode, ensuring that layer data types are preserved during compilation. For a detailed demonstration of this behavior, refer to the provided tutorial. To learn more about strong typing in TensorRT, refer to the relevant section in the TensorRT Developer Guide.
Model Zoo
- We have added Segment Anything Model 2 compilation using Torch-TensorRT (SAM2) to our model zoo. The example can be found here
- We have also added a torch.compile example for GPT2 using the
tensorrtbackend. This example demonstrates the use of the HuggingFacegenerateAPI for auto-regressive decoding. For export based workflow (ir=dynamo), we provide a custom generate function to handle output decoding.
Multi-GPU Improvements
There are experimental improvements to multi-gpu workflows, including pulling NCCL operations into TensorRT subgraphs automatically. These should be considered alpha stability. More information can be found here: https://github.com/pytorch/TensorRT/tree/main/examples/distributed_inference
What's Changed
- upgrade modelopt by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3160
- feat: exclude refit sensitive ops from TRT compilation by @peri044 in https://github.com/pytorch/TensorRT/pull/3159
- tool: Adding support for the uv system by @narendasan in https://github.com/pytorch/TensorRT/pull/3125
- upgrade torch from 2.5.0.dev to 2.6.0.dev in main branch by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3165
- fix: Fix static arange export by @peri044 in https://github.com/pytorch/TensorRT/pull/3194
- docs: A tutorial on how to overload converters in Torch-TensorRT by @narendasan in https://github.com/pytorch/TensorRT/pull/3197
- Adjust cpp torch trt logging level with compiler option by @keehyuna in https://github.com/pytorch/TensorRT/pull/3181
- extend the timeout-minutes in build/test from 60 min to 120 min by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3203
- extend windows build from 60 min to 120 min by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3218
- fix the global partitioner bug by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3195
- feat: Implement FP32 accumulation for matmul by @peri044 in https://github.com/pytorch/TensorRT/pull/3110
- chore: Make substitute-runner in Windows CI work again by @HolyWu in https://github.com/pytorch/TensorRT/pull/3225
- Run test_base_fp8 for compute capability 8.9 or later by @HolyWu in https://github.com/pytorch/TensorRT/pull/3164
- Fixed batchnorm bug by @cehongwang in https://github.com/pytorch/TensorRT/pull/3170
- Fix for warning as default stream was used in enqueueV3 by @keehyuna in https://github.com/pytorch/TensorRT/pull/3191
- chore: doc updates by @peri044 in https://github.com/pytorch/TensorRT/pull/3238
- chore: Additional Doc fixes by @peri044 in https://github.com/pytorch/TensorRT/pull/3243
- docs: escape dash to avoid collapsing -- to - by @dgcnz in https://github.com/pytorch/TensorRT/pull/3235
- feat: log_softmax decomposition by @HolyWu in https://github.com/pytorch/TensorRT/pull/3137
- fix: change floordiv to divmod for
//tests/core/lowering:test_remove_unnecessary_castsby @zewenli98 in https://github.com/pytorch/TensorRT/pull/3223 - Add support for JetPack 6.1 build by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3211
- Require full compilation arg by @apbose in https://github.com/pytorch/TensorRT/pull/3193
- Fix code example in README.md by @juliusgh in https://github.com/pytorch/TensorRT/pull/3253
- chore: Access user settings within the lowering system by @peri044 in https://github.com/pytorch/TensorRT/pull/3245
- fix: expand dim for scalar numpy when freezing tensors to IConstantLayers by @chohk88 in https://github.com/pytorch/TensorRT/pull/3251
- chore: Adapt CIA ops decomposition handling in upsample converters to torch 2.6 by @HolyWu in https://github.com/pytorch/TensorRT/pull/3227
- feat: Support weight streaming by @keehyuna in https://github.com/pytorch/TensorRT/pull/3111
- fix issue 3259 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3260
- skip dummy inference and run_shape_analysis by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3212
- Remove numpy version constraint in test requirements by @HolyWu in https://github.com/pytorch/TensorRT/pull/3264
- fix issue#3269: unwrap tensor shape without opt val by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3279
- disable python 3.13 for linux by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3271
- switch from fx.symbolic_trace to dynamo_trace for converter test part-1 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3261
- cross compile for windows by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3220
- chore: add source_ir in slice layer name by @jiwoong-choi in https://github.com/pytorch/TensorRT/pull/3284
- fix MutableTorchTensorRTModule load issue by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3281
- don't initialize cuda at import time by @technillogue in https://github.com/pytorch/TensorRT/pull/3244
- change decomposition default table due to upstream torch change by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3291
- feat: Support exporting Torch-TRT compiled Graphmodules by @peri044 in https://github.com/pytorch/TensorRT/pull/3262
- Add tensorrt test workflow by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3266
- test future tensorrt version in windows wf by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3290
- fix: get_hash function for engine caching by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3293
- feat: InstanceNorm decomposition by @HolyWu in https://github.com/pytorch/TensorRT/pull/3288
- MODULE.bazel file for NGC docker container by @apbose in https://github.com/pytorch/TensorRT/pull/3156
- docs: Updated tutorial for triton + torch-tensorrt by @narendasan in https://github.com/pytorch/TensorRT/pull/3292
- Fix LayerNorm fp16 precision by @HolyWu in https://github.com/pytorch/TensorRT/pull/3272
- Get decompositions only for CIA ops by @HolyWu in https://github.com/pytorch/TensorRT/pull/3297
- fix: cumsum add_constant bug fix (add dtype for np zeros) by @chohk88 in https://github.com/pytorch/TensorRT/pull/3258
- fix: change docker img from manylinux to manylinux2_28 for all CUDA versions by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3312
- fix: export USE_CXX11_ABI=1 for cuda12.6 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3319
- feat: Support weight-stripped engine and REFIT_IDENTICAL flag by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3167
- fix: Fix additional mem copy of the model during re-export by @peri044 in https://github.com/pytorch/TensorRT/pull/3302
- fix: Fix copying metadata during lowering by @peri044 in https://github.com/pytorch/TensorRT/pull/3320
- upgrade tensorrt dependency to >=10.3.0,<=10.6.0 by @lanluo-nvidia in https://github.com/pytorch/TensorRT/pull/3286
- Use INormalizationLayer for GroupNorm by @HolyWu in https://github.com/pytorch/TensorRT/pull/3273
- Torch TRT ngc container changes by @apbose in https://github.com/pytorch/TensorRT/pull/3299
- feat: Add SAM2 to our model zoo by @peri044 in https://github.com/pytorch/TensorRT/pull/3318
- fix: Remove pytorch overhead while finding fusions for fully convertible models by @peri044 in https://github.com/pytorch/TensorRT/pull/3311
- feat: Automatically generating converters for QDP plugins by @narendasan in https://github.com/pytorch/TensorRT/pull/3321
- Fix failed TestNativeLayerNormConverter by @HolyWu in https://github.com/pytorch/TensorRT/pull/3315
- fix: Correct mutex scope in execute_engine() by @keehyuna in https://github.com/pytorch/TensorRT/pull/3310
- Replace scaled_dot_product_attention lowering pass with decomposition by @HolyWu in https://github.com/pytorch/TensorRT/pull/3296
- chore: example fixes by @peri044 in https://github.com/pytorch/TensorRT/pull/3176
- Remove linear lowering pass and converter by @HolyWu in https://github.com/pytorch/TensorRT/pull/3323
- fix: Fix meta kernel ops import issue for Python only builds by @peri044 in https://github.com/pytorch/TensorRT/pull/3322
- Add test case for ITensor weight in convolution and fix related bug by @chohk88 in https://github.com/pytorch/TensorRT/pull/3327
- feat: Runtime output buffer optimization by @keehyuna in https://github.com/pytorch/TensorRT/pull/3276
- full_like to full decomposition moving to decomposition.py for dynami… by @apbose in https://github.com/pytorch/TensorRT/pull/3289
- Wrapper module around TRT + pytorch subgraphs by @keehyuna in https://github.com/pytorch/TensorRT/pull/3270
- feat: add args for profiling engine caching by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3329
- fix: update bazelisk to fix build errors by @peri044 in https://github.com/pytorch/TensorRT/pull/3328
- chore: adding additional logging to the converter registry system by @narendasan in https://github.com/pytorch/TensorRT/pull/3199
- chore: revert attention decomposition due to flux bug by @peri044 in https://github.com/pytorch/TensorRT/pull/3332
- fix: Fix null inputs case by @peri044 in https://github.com/pytorch/TensorRT/pull/3334
- fix: Record cudagraphs when weight streaming budget has changed by @keehyuna in https://github.com/pytorch/TensorRT/pull/3309
- Cherrypick: Bump TRT version to 10.7 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3341
- Cherrypick: nccl ops multi gpu by @apbose in https://github.com/pytorch/TensorRT/pull/3342
- Update build-test-linux.yml by @narendasan in https://github.com/pytorch/TensorRT/pull/3345
- fix: CI errors on release 2.6 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3358
- fix: CI docker build error for release 2.6 by @zewenli98 in https://github.com/pytorch/TensorRT/pull/3360
- [cherry-pick] trtp for 2.6 release by @narendasan in https://github.com/pytorch/TensorRT/pull/3372
New Contributors
- @dgcnz made their first contribution in https://github.com/pytorch/TensorRT/pull/3235
- @technillogue made their first contribution in https://github.com/pytorch/TensorRT/pull/3244
Full Changelog: https://github.com/pytorch/TensorRT/compare/v2.5.0...v2.6.0