Table 2. Throughput of Native Arithmetic Instructions. (Number of Operations per Clock Cycle per Multiprocessor) Compute Capability Architecture , FER , FER , KPL , MAX 32-bit floating-point add multiply multiply-add , 32 , 48 , 192 , 128 64-bit floating-point add multiply multiply-add , 16 , 4 , 8 , 1 32-bit floating-point reciprocal reciprocal square root log2f/exp2f/sine/cosine , 4 , 8 , 32 , 32 32-bit integer add extended-precision add subtract extended-precision subtract , 32 , 48 , 160 , 128 32-bit integer multiply multiply-add extended-precision multiply-add , 16 , 16 , 32 , Multiple instructions 32-bit integer shift , 16 , 16 , 32 , 64 compare minimum maximum , 32 , 48 , 160 , 64 32-bit integer bit reverse bit field extract/insert , 16 , 16 , 32 , 64 32-bit bitwise AND / OR / XOR , 32 , 160 , 160 , 128 count of leading zeros most significant non-sign bit , 16 , 16 , 32 , Multiple instructions population count , 16 , 16 , 32 , 32 warp shuffle , N/A , N/A , 32 , 32 sum of absolute difference , 16 , 16 , 32 , 64 SIMD video instructions vabsdiff2 , N/A , N/A , 160 , Multiple instructions SIMD video instructions vabsdiff4 , N/A , N/A , 160 , Multiple instructions All other SIMD video instructions , 16 , 16 , 32 , Multiple instructions Type conversions from 8/16-bit integer to 32-bit types , 16 , 16 , 128 , 32 Type conversions from and to 64-bit types , 16 , 4 , 8 , 4 All other type conversions , 16 , 16 , 32 , 32 Some tips: * bit field operations are as fast as shift operations.