Searched hist:"15 c2d45d" (Results 1 – 3 of 3) sorted by relevance
/openbmc/linux/arch/powerpc/lib/ |
H A D | string.S | 15c2d45d Tue Jan 20 19:27:38 CST 2015 Anton Blanchard <anton@samba.org> powerpc: Add 64bit optimised memcmp
I noticed ksm spending quite a lot of time in memcmp on a large KVM box. The current memcmp loop is very unoptimised - byte at a time compares with no loop unrolling. We can do much much better.
Optimise the loop in a few ways:
- Unroll the byte at a time loop
- For large (at least 32 byte) comparisons that are also 8 byte aligned, use an unrolled modulo scheduled loop using 8 byte loads. This is similar to our glibc memcmp.
A simple microbenchmark testing 10000000 iterations of an 8192 byte memcmp was used to measure the performance:
baseline: 29.93 s
modified: 1.70 s
Just over 17x faster.
v2: Incorporated some suggestions from Segher:
- Use andi. instead of rdlicl.
- Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare and was a relic from a previous version.
- Don't use cr5, we have plans to use that CR field for fast local atomics.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 15c2d45d Tue Jan 20 19:27:38 CST 2015 Anton Blanchard <anton@samba.org> powerpc: Add 64bit optimised memcmp I noticed ksm spending quite a lot of time in memcmp on a large KVM box. The current memcmp loop is very unoptimised - byte at a time compares with no loop unrolling. We can do much much better. Optimise the loop in a few ways: - Unroll the byte at a time loop - For large (at least 32 byte) comparisons that are also 8 byte aligned, use an unrolled modulo scheduled loop using 8 byte loads. This is similar to our glibc memcmp. A simple microbenchmark testing 10000000 iterations of an 8192 byte memcmp was used to measure the performance: baseline: 29.93 s modified: 1.70 s Just over 17x faster. v2: Incorporated some suggestions from Segher: - Use andi. instead of rdlicl. - Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare and was a relic from a previous version. - Don't use cr5, we have plans to use that CR field for fast local atomics. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
H A D | memcmp_64.S | 15c2d45d Tue Jan 20 19:27:38 CST 2015 Anton Blanchard <anton@samba.org> powerpc: Add 64bit optimised memcmp
I noticed ksm spending quite a lot of time in memcmp on a large KVM box. The current memcmp loop is very unoptimised - byte at a time compares with no loop unrolling. We can do much much better.
Optimise the loop in a few ways:
- Unroll the byte at a time loop
- For large (at least 32 byte) comparisons that are also 8 byte aligned, use an unrolled modulo scheduled loop using 8 byte loads. This is similar to our glibc memcmp.
A simple microbenchmark testing 10000000 iterations of an 8192 byte memcmp was used to measure the performance:
baseline: 29.93 s
modified: 1.70 s
Just over 17x faster.
v2: Incorporated some suggestions from Segher:
- Use andi. instead of rdlicl.
- Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare and was a relic from a previous version.
- Don't use cr5, we have plans to use that CR field for fast local atomics.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 15c2d45d Tue Jan 20 19:27:38 CST 2015 Anton Blanchard <anton@samba.org> powerpc: Add 64bit optimised memcmp I noticed ksm spending quite a lot of time in memcmp on a large KVM box. The current memcmp loop is very unoptimised - byte at a time compares with no loop unrolling. We can do much much better. Optimise the loop in a few ways: - Unroll the byte at a time loop - For large (at least 32 byte) comparisons that are also 8 byte aligned, use an unrolled modulo scheduled loop using 8 byte loads. This is similar to our glibc memcmp. A simple microbenchmark testing 10000000 iterations of an 8192 byte memcmp was used to measure the performance: baseline: 29.93 s modified: 1.70 s Just over 17x faster. v2: Incorporated some suggestions from Segher: - Use andi. instead of rdlicl. - Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare and was a relic from a previous version. - Don't use cr5, we have plans to use that CR field for fast local atomics. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
H A D | Makefile | 15c2d45d Tue Jan 20 19:27:38 CST 2015 Anton Blanchard <anton@samba.org> powerpc: Add 64bit optimised memcmp
I noticed ksm spending quite a lot of time in memcmp on a large KVM box. The current memcmp loop is very unoptimised - byte at a time compares with no loop unrolling. We can do much much better.
Optimise the loop in a few ways:
- Unroll the byte at a time loop
- For large (at least 32 byte) comparisons that are also 8 byte aligned, use an unrolled modulo scheduled loop using 8 byte loads. This is similar to our glibc memcmp.
A simple microbenchmark testing 10000000 iterations of an 8192 byte memcmp was used to measure the performance:
baseline: 29.93 s
modified: 1.70 s
Just over 17x faster.
v2: Incorporated some suggestions from Segher:
- Use andi. instead of rdlicl.
- Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare and was a relic from a previous version.
- Don't use cr5, we have plans to use that CR field for fast local atomics.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 15c2d45d Tue Jan 20 19:27:38 CST 2015 Anton Blanchard <anton@samba.org> powerpc: Add 64bit optimised memcmp I noticed ksm spending quite a lot of time in memcmp on a large KVM box. The current memcmp loop is very unoptimised - byte at a time compares with no loop unrolling. We can do much much better. Optimise the loop in a few ways: - Unroll the byte at a time loop - For large (at least 32 byte) comparisons that are also 8 byte aligned, use an unrolled modulo scheduled loop using 8 byte loads. This is similar to our glibc memcmp. A simple microbenchmark testing 10000000 iterations of an 8192 byte memcmp was used to measure the performance: baseline: 29.93 s modified: 1.70 s Just over 17x faster. v2: Incorporated some suggestions from Segher: - Use andi. instead of rdlicl. - Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare and was a relic from a previous version. - Don't use cr5, we have plans to use that CR field for fast local atomics. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|