Hadoop reduce method reuses objects

December 28th, 2009 | by bryan |

Over the holiday break I was playing around with creating a Map/Reduce job that would scan through all of the content items and then create a link graph. It was a fairly straightforward job. I would scan each content item for all of the hrefs and for each one would emit a record that contained the hash of the url, the contentId that was linking to it, as well as whether that content item was the owner of the url. I needed to create a new LinkRecord Writable object keyed off of the hash of the URL to encapsulate these items, which was fairly straightforward by just implementing Writable.

Then in the reduce method I collected all of the LinkRecords for the url. I needed to scan the list of LinkRecords several times because I needed to find the oldest content item that claimed to be the owner of the URL. Once I had that I could differentiate between items from the same content feed from items from different feeds. To do this I used a bit of code like:

List<LinkRecord> list = new ArrayList<LinkRecord>();

for(LinkRecord record:values){
list.add(record);
}

Then I would iterate through the list and do all of the necessary work. This all seemed to run as expected, however whenever I looked at the results, each link would seem to have n copies of the same value. But, different links would have a different number of copies.

What appeared to be going on is that the Iterable values was reusing the object that it was exposing in the loop and just changing the objects parameters. So my list ended up having n references to the same object.

To solve this, in the for loop instead of adding the record to my list, I created a new LinkRecord object, copied the parameters from the loop object into the new LinkRecord and then added the new LinkRecord to the list. This allowed my code to function as expected.

 
 
 
 

Post a Comment