Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
CUTLASS 4.1.0 source code.tar.gz | 2025-07-22 | 33.1 MB | |
CUTLASS 4.1.0 source code.zip | 2025-07-22 | 41.7 MB | |
README.md | 2025-07-22 | 2.3 kB | |
Totals: 3 Items | 74.8 MB | 7 |
CuTe DSL
* Add aarch64 support, you can now pip install nvidia-cutlass-dsl
on GB200 systems!
* More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- Blackwell Mamba2 SSD
- Blackwell SM100 persistent dense blockscaled GEMM with static scheduling
* API updates
- Please refer to FUNCTIONALITY.md for details
CUTLASS C++
* Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
* Replace subbyte_iterator
with cute::recast_ptr
when constructing logical iterators/arrays.
* CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy get_layoutA|B|C_MN
and friends from Atoms/TiledX.
- Factor out print_latex
and friends and rewrite.
- Factor out print_svg
and friends and rewrite.
* Support Blackwell SM100 SIMT packed fp32x2 kernels.
* Support residual add for implicit gemm kernels.
* Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
* Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
* Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
- Fix incorrect regex logic for L1 test.