hadoop - Understanding Map-Reduce -


so has confused me. i'm not sure how map-reduce works , seem lost in exact chain of events.

my understanding:

  1. master chunks files , hands them mappers (k1, v1)
  2. mappers take files , perform map(k1,v1)-> (k2,v2) , output data individual files.
  3. this i'm lost.
    1. so these individual files combined how? if keys repeated in each file?
    2. who combining? master? if files go master @ step, wont massive bottleneck? combined 1 file? files re-chunked , handed reducers now?
    3. or, if files go directly reducers instead, happens repeated k3's in (k3, v3) files @ end of process? how combined? there map-reduce phase? , if so, need create new operations: map(k3,v3)->(k4,v4), reduce(k4,v4)->(k3,v3)

i think sum up, dont how files being re-combined , causing map-reduce logic fail.

step 3 called "shuffle". it's 1 of main value-adds of map reduce frameworks, although it's expensive large datasets. framework akin group operation on complete set of records output mappers, , reducers called each group of records. answer individual questions 3:

3.1. imagine job configured have r total reducers. framework carves every 1 of map output files r pieces , sends each piece 1 reducer task. m total mappers, mr little slices flying around. when particular reducer has received slices needs, merges them , sorts result k2 key, , groups records on fly key individual calls reduce(). if there duplicate k2 keys, group larger singleton. in fact, point. if mappers don't ever output identical keys, algorithm should not need reduce phase , can skip expensive shuffle altogether.

3.2. load of doing data movement spread across whole cluster because each reducer task knows outputs wants , asks them each mapper. thing master node has coordinate, i.e., tell each reducer when start pulling mapper outputs, watch dead nodes, , keep track of everyone's progress.

3.3. reducer output not examined framework or combined in way. many reducer tasks have (r), that's how many output files k3, v3 records in them get. if need combined again, run job on output.


Comments

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -