![]() ![]() The following cases all meet the definition of BatchTranspose, where the contents of the brackets indicate the order of the dimensions.Īs we can see from the graph, OneFlow can approach native Copy operations in most cases, in terms of both computation time and bandwidth utilization. matrix transpose, exchanges only the last two dimensions of the matrix. In some special cases, we can merge accesses to improve bandwidth utilization and speed, which brings us to the following discuss on BatchTranspose optimization. Regular Permute is applicable to a wide range of situations, and therefore there may be cases where memory accesses are not merged. Using the two optimization techniques above, OneFlow can be easily faster than PyTorch. The bandwidth of Permute is a little higher than that of native Copy because there is no unroll inter-instructional parallelism optimization in the Copy Kernel, whereas the Permute Kernel has done the relevant optimization internally. In terms of operating time compared with PyTorch, Oneflow is at least 1.24 times faster, and can reach 1.4 times at most. The test data covers different sizes from 16MB to 128MB, and the data type contains both fp32 and half data type.Īs we can see from the two graphs above, OneFlow can approximate or even slightly exceed the bandwidth of the Copy operation in most cases. The test environment is NVIDIA A100 40GB and the scenario is (0, 1, 2)->(1, 0, 2), with the horizontal coordinates indicating the data shape and data type.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |