28
loading...
This website collects cookies to deliver better user experience
package main
import "testing"
var sum = 0
func BenchmarkNoop(b *testing.B) {
for i := 0; i < b.N; i++ {
sum++
}
}
// go test -bench=.
// BenchmarkNoop-8 873429528 1.36 ns/op
for i:= 0; i < 1000000; i++ {
}
0: inc i
1: cmp i, 1000000
2: jlt 10
3: ... code
4: jmp 0
10: .. rest of program
i = 0
check:
if i > 1000000
goto done
for loop code
...
goto check
done:
rest of program
i++
code
i++
code
i++
code
for i:=0; i<n; i++ { f() }
// tell the go compiler to not inline the function by doing go:noinline
// semantic comment
//go:noinline
func add(a, b int) (int, int) {
return a + b, 1000
}
// go build main.go && go tool objdump -s main.add main
MOVQ $0x0, 0x18(SP)
MOVQ $0x0, 0x20(SP)
MOVQ 0x8(SP), AX
ADDQ 0x10(SP), AX
MOVQ AX, 0x18(SP)
MOVQ $0x3e8, 0x20(SP)
RET
func main() {
a, b := add(100, 200)
add(a, b)
}
// go build main.go && go tool objdump -s main.main main
MOVQ $0x64, 0(SP)
MOVQ $0xc8, 0x8(SP)
CALL main.add(SB)
MOVQ 0x10(SP), AX
MOVQ AX, 0x38(SP)
MOVQ 0x18(SP), AX
MOVQ AX, 0x30(SP)
MOVQ 0x38(SP), AX
MOVQ AX, 0x28(SP)
MOVQ 0x30(SP), AX
MOVQ AX, 0x20(SP)
MOVQ 0x28(SP), AX
MOVQ AX, 0(SP)
MOVQ 0x20(SP), AX
MOVQ AX, 0x8(SP)
CALL main.add(SB)
1 MOV addr, r0 copy addr in r0
2 MOV addr, r1 copy addr in r1
3 MOV r0, addr copy r0 in addr
4 MOV r1, addr copy r1 in addr
5 MOV r0, $value store $value in r0
6 MOV r1, $value store $value in r1
7 ADD adds r0 and r1 and stores value in r0
8 JMP addr jump to given address
9 HAL stop
0000: MOV r0, $0
0001: MOV r1, $1
0002: ADD
0003: MOV r0, 10
0004: JMP 1
addr:0 00000101 // mov r0, $0
addr:1 00000000
addr:2 00000110 // mov r1, $1
addr:3 00000001
addr:4 00000111 // add
addr:5 00000011 // mov r0, 10
addr:6 00001010
addr:7 00001000 // jmp 1
addr:8 00000001
addr:9 00000000 // nothing
addr:10 00000000 // result of mov r0, 10
Fetch Stage: The next instruction is fetched from the memory address
that is currently stored in the program counter and stored into the
instruction register. At the end of the fetch operation, the PC points
to the next instruction that will be read at the next cycle.
Decode Stage: During this stage, the encoded instruction presented in
the instruction register is interpreted by the decoder.
Execute Stage: The control unit of the CPU passes the decoded
information as a sequence of control signals to the relevant function
units of the CPU to perform the actions required by the instruction,
such as reading values from registers, passing them to the ALU to
perform mathematical or logic functions on them, and writing the
result back to a register. If the ALU is involved, it sends a
condition signal back to the CU. The result generated by the operation
is stored in the main memory or sent to an output device. Based on the
feedback from the ALU, the PC may be updated to a different address
from which the next instruction will be fetched.
Repeat Cycle
L1 cache reference 0.5 ns
Executing Instruction 1 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache,
200x L1 cache
There are no simple answers. Inline functions might make the code
faster, they might make it slower. They might make the executable
larger, they might make it smaller. They might cause thrashing,
they might prevent thrashing. And they might be, and often are,
totally irrelevant to speed.
inline functions might make it faster:
As shown above, procedural integration might remove a bunch of
unnecessary instructions, which might make things run faster.
inline functions might make it slower:
Too much inlining might cause code bloat, which might cause
"thrashing" on demand-paged virtual-memory systems. In other
words, if the executable size is too big, the system might spend
most of its time going out to disk to fetch the next chunk of
code.
inline functions might make it larger:
This is the notion of code bloat, as described above. For
example, if a system has 100 inline functions each of which
expands to 100 bytes of executable code and is called in 100
places, that's an increase of 1MB. Is that 1MB going to cause
problems? Who knows, but it is possible that that last 1MB could
cause the system to "thrash," and that could slow things down.
inline functions might make it smaller:
The compiler often generates more code to push/pop
registers/parameters than it would by inline-expanding the
function's body. This happens with very small functions, and it
also happens with large functions when the optimizer is able to
remove a lot of redundant code through procedural integration —
that is, when the optimizer is able to make the large function
small.
inline functions might cause thrashing:
Inlining might increase the size of the binary executable, and
that might cause thrashing.
inline functions might prevent thrashing
The working set size (number of pages that need to be in memory
at once) might go down even if the executable size goes up. When
f() calls g(), the code is often on two distinct pages; when the
compiler procedurally integrates the code of g() into f(), the
code is often on the same page.
inline functions might increase the number of cache misses:
Inlining might cause an inner loop to span across multiple lines
of the memory cache, and that might cause thrashing of the
memory-cache.
inline functions might decrease the number of cache misses:
Inlining usually improves locality of reference within the
binary code, which might decrease the number of cache lines
needed to store the code of an inner loop. This ultimately could
cause a CPU-bound application to run faster.
inline functions might be irrelevant to speed:
Most systems are not CPU-bound. Most systems are I/O-bound,
database-bound or network-bound, meaning the bottleneck in the
system's overall performance is the file system, the database or
the network. Unless your "CPU meter" is pegged at 100%, inline
functions probably won't make your system faster. (Even in
CPU-bound systems, inline will help only when used within the
bottleneck itself, and the bottleneck is typically in only a
small percentage of the code.)
There are no simple answers: You have to play with it to see what
is best. Do not settle for simplistic answers like, "Never use
inline functions" or "Always use inline functions" or "Use inline
functions if and only if the function is less than N lines of
code." These one-size-fits-all rules may be easy to write down,
but they will produce sub-optimal results.
type Database struct {
writer io.Writer
}
func (d *Database) SetWriter(f io.Writer) error {
d.writer = f
}
....
func (d *Database) writeBlob(b []byte) error {
checksum := hash(b)
_, err := d.writer.Write(checksum)
if err != nil {
return err
}
_, err := d.writer.Write(b)
return err
}
....
if x == nil
goto panic
... code
return x
panic:
help build a stacktrace
START:
CMPQ 0x10(CX), SP
JBE CALL_PANIC
MOVQ 0x40(SP), AX
TESTQ AX, AX
JLE PANIC
... work work ...
TRACE:
MOVQ CX, 0x50(SP)
MOVQ 0x28(SP), BP
ADDQ $0x30, SP
RET
PANIC:
XORL CX, CX // cx = 0
JMP TRACE
CALL_PANIC:
CALL runtime.morestack_noctxt(SB)
JMP START
package main
type Operation interface {
Apply() int
}
type Number struct {
n int
}
func (x Number) Apply() int {
return x.n
}
type Add struct {
Operations []Operation
}
func (x Add) Apply() int {
r := 0
for _, v := range x.Operations {
r += v.Apply()
}
return r
}
type Sub struct {
Operations []Operation
}
func (x Sub) Apply() int {
r := 0
for _, v := range x.Operations {
r -= v.Apply()
}
return r
}
type AddCustom struct {
Operations []Number
}
func (x AddCustom) Apply() int {
r := 0
for _, v := range x.Operations {
r += v.Apply()
}
return r
}
func main() {
n := 0
op := Add{Operations: []Operation{Number{n: 5}, Number{n: 6}}}
n += op.Apply()
opc := AddCustom{Operations: []Number{Number{n: 5}, Number{n: 6}}}
n += opc.Apply()
}
// go build main.go && go tool objdump main
TEXT main.Add.Apply(SB) main.go
MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE 0x4526c7
SUBQ $0x30, SP
MOVQ BP, 0x28(SP)
LEAQ 0x28(SP), BP
MOVQ 0x40(SP), AX
TESTQ AX, AX
JLE 0x4526c3
MOVQ 0x38(SP), CX
XORL DX, DX
XORL BX, BX
JMP 0x452678
MOVQ 0x20(SP), SI
ADDQ $0x10, SI
MOVQ AX, DX
MOVQ CX, BX
MOVQ SI, CX
MOVQ CX, 0x20(SP)
MOVQ DX, 0x18(SP)
MOVQ BX, 0x10(SP)
MOVQ 0(CX), AX
MOVQ 0x8(CX), SI
MOVQ 0x18(AX), AX
MOVQ SI, 0(SP)
CALL AX
MOVQ 0x18(SP), AX
INCQ AX
MOVQ 0x10(SP), CX
ADDQ 0x8(SP), CX
MOVQ 0x40(SP), DX
CMPQ DX, AX
JL 0x452666
MOVQ CX, 0x50(SP)
MOVQ 0x28(SP), BP
ADDQ $0x30, SP
RET
XORL CX, CX
JMP 0x4526b4
CALL runtime.morestack_noctxt(SB)
JMP main.Add.Apply(SB)
TEXT main.AddCustom.Apply(SB) main.go
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), CX
XORL DX, DX
XORL BX, BX
JMP 0x4526fa
MOVQ 0(AX)(DX*8), SI
INCQ DX
ADDQ SI, BX
CMPQ CX, DX
JL 0x4526f0
MOVQ BX, 0x20(SP)
RET
goos: linux
goarch: amd64
BenchmarkInterface-8 171836106 6.81 ns/op
BenchmarkInline-8 424364508 2.70 ns/op
BenchmarkNoop-8 898746903 1.36 ns/op
PASS
ok command-line-arguments 4.673s
28