performance - per clock perf. - can I use different registers for same instruction? -

- February 15, 2012

can use 4 general purpose registers r8,r9,r10,r11 each mov instruction independent operations , in impression cpu doing instructions in single clock ?

i want know because according agner fog's instruction table, says reciprocal throughput of mov instruction 0.25. means cpu should able execute 4 mov operations per cycle. or misinterpreted ??

i noob , have been learning assembly in masm since 2 months (mainly learning debugging stuffs how registers works , fun).

edit, re-read question, , you're asking different registers. i'll leave in original answer; let's pretend question wasn't trivial case. :p

yes, without register renaming, these instructions can execute (on separate execution units) in same cycle because they're independent of each other.

mov   eax, 1 mov   ebx, ecx mov   edx, [mem] xor   esi,esi     ;xor-zero: doesn't use execution unit on snb-family

this easiest case superscalar execution. if eax/rax destination 4 instructions, register-renaming still allow 4 instructions execute in parallel.

out-of-order execution allows 4 nearby instructions separate dependency chains execute @ same time, if weren't decoded or issued in same clock cycle. , won't retire in same cycle either, if there instructions between them. (the x86 isa guarantees precise exceptions, other isas (arm/ppc/etc.). current designs accomplish in-order retirement. if memory op segfaults, program stop @ instruction, not "well, there segfault somewhere recently, can't tell where". (that non-precise exceptions).)

superscalar in-order designs atom, or p5 (original pentium) can still take advantage of parallelism in these 4 independent instructions, not in many other cases.

in hand-crafted loop, it's common snb-family cpu able sustain on 3 fused-domain uops per cycle. (it's easy write loops run @ less 1 fused-domain uop per cycle, due latency, nothing of cache misses or branch mispredicts.)

yes, multiple writes same architectural register can execute in parallel. register renaming not bottleneck on intel or amd designs.

to understand , make full use of agner fog's tables, have read his microarch guide, or @ least "optimizing assembly" guide. see stuff @ x86 wiki.

as agner fog's microarch pdf points out (section 9.8 intel snb/ivb):

register renaming controlled register alias table (rat) , reorder buffer (rob), shown in figure 6.1. μops decoders , stack engine go rat via queue , rob-read , reservation station. rat can handle 4 μops per clock cycle. the rat can rename 4 registers per clock cycle, , can rename same register 4 times in 1 clock cycle.

read-modify-write story (destination of add instruction). read-modify-write of architectural register (part of) dependency chain, while unconditional mov or xor-zeroing starts new dep chain. (same output of other instructions lea don't read destination).

those register writes still rename architectural register new physical register well. how cpus handle cases like

mov eax, 1      ; start of dep chain mov [mem+rax+rcx], eax inc eax         ; eax renamed again

the store needs value of eax before inc. gets because when checks rat, architectural eax still pointing same physical register mov eax,1 wrote. inc can't modify same physical register because doesn't know if not done yet previous value of eax.

Search This Blog

SSIS

performance - per clock perf. - can I use different registers for same instruction? -

Comments

Post a Comment

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -