Another update to duplicate detection

March 4th, 2008 | by bryan |

So after letting the latest experiment run for awhile it’s become apparent that while the new duplicate detection is better than the last setup, we’re still not where we need to be.

Originally ReadPath used a duplicate detection system based on shingle comparisons of the stories. This system was incredibly effective at finding duplicates. The same system was used for scoring the user preference vectors, however the user preference scoring was modified a couple of weeks ago to use a different approach. This also changed the duplicate detection as they used the same code base. It’s become clear though that the same code can’t be used for both purposes so tomorrow I’ll be reverting the duplicate detection code to use the shingle method.


Post a Comment