Feed updates are off temporarily

March 3rd, 2010

I’ve turned off feed updates temporarily so that I can move the spider service over to some new beefier hardware. There were several systems running on the old server and the load on this server was slowing down parts of the site. By moving this system to a new server everything should speed up a bit.

The main issue with moving this system is that ReadPath keeps lists of all of the content items that it’s already seen. There are several million lists which total 46G of files. It’s rather surprising how hard it is to move these types of files around. I did an initial rsync of the dir, which took almost an entire day since it was done while everything was still running and there was lots of random i/o on the disk. Now that I’ve turned updates off I’m doing a final rsync. This rsync is still taking a significant amount of time since even though it’s only been a couple hours since the initial rsync, a large percentage of the files have been updated with new content.

I’ll post an update as soon as I have the changes deployed and updates turned back on again.

Update:

The Feed Spider has now been turned back on again. You should see all feeds update normally now.

Content update

February 14th, 2010

Just a quick update. The latest dictionary update job for ReadPath just completed.

835,029 feeds monitored.

299,113,314 content items.

28,914,230,869 words for the dictionary. This works out to 123,036,971 distinct two word pairs.

Entire job ran just over 7 hours on an 8 node hadoop/hbase cluster. The job read just under 400Gb of data and output 462Gb of data.

Facebook sharing is now live

February 12th, 2010

It’s finally here, you can now post from the stories that you read on ReadPath to Facebook. The permissions on who sees what on these posts is handled on the Facebook end. You can set it so that everyone sees the posts, just friends, or friends of friends.

As always, I’m working to squash any bugs that might pop up, but if you see anything not working the way you think it should let me know.

Also if you’ve got the time, be sure to become a fan of the ReadPath application on Facebook and let everyone know about it.

And on a further positive note, ReadPath traffic has increased 5x in the last week alone. These little tweaks that I make at night after Caitlin goes to bed are starting to pay off.

Thumbnail Image Issue Fixed

February 12th, 2010

Found the fix for thumbnail images not showing on posts for all browsers other than Firefox. That’ll teach me not to be lazy and test only in Firefox on a Mac. Who knew that anyone used anything else?

Turns out that Firefox is very forgiving with a fat finger typo in some javascript. The other browsers, not so much.

Should be fixed now though.

Facebook Integration

February 11th, 2010

Well, it’s only over a year overdue. I think something might have happened in the last year that may have slowed down my development a bit, but Facebook integration is finally on its way. It’s now possible to link your ReadPath account to your Facebook account.

If you’re already logged in to ReadPath you should see a box on the right side of the homepage with a button to “Connect with Facebook”. By clicking this you’ll connect the two accounts. Once you’ve completed this, you can then login with either the Facebook login or your ReadPath login, both will work.

If you don’t already have a ReadPath account you can now register using your Facebook login instead. Simply go to the login page and click the “Login with Facebook” button. This will create a new account on ReadPath that is linked to your Facebook account.

You might ask, why would I want to do this? Outside of making it easier for new users to register for ReadPath, I’ll soon be adding the buttons to share with Facebook. This will allow you to post from ReadPath back to your Facebook wall and discuss items with your Facebook network.

The login / registration flow is a bit tricky. If you do find anything that isn’t working properly, let me know.

Thanks,

Bryan

Java FixedThreadPool

January 20th, 2010

When creating the code to do the scanning of content for potential thumbnail images I needed to work with the FixedThreadPool to get the level of performance that I wanted.  There are large amounts of code within ReadPath that extend a class Scanner. This class creates a Thread and makes it simple to do chunks of work with checkpoints and wait times. For each block of time a process method is called. If the work takes less than the timeout to complete then the thread waits till the timeout has expired to be called again, otherwise it starts with the next batch immediately. This allows you to rate limit the work being done so that it doesn’t swamp other systems.

The checkpoint allows the thread to be stopped and then know where to start up again. Since code is being pushed out on a regular basis, you need to be able to restart the work being done. The other requirement is that the work needs to be idempotent so that the checkpoints don’t need to be overly fine grained.

The initial version of this code worked great, scanned content items and saved the meta data for thumbnail images. An issue though is that if the html didn’t include the height and width of the image, which I need to get, then I had to go fetch the image to determine the size. This fetching of external images introduces a significant wait which slowed down the overall throughput. The answer of course was to make this system multithreaded. A great way to do this, especially because I still wanted control over the throughput was to use the FixedThreadPool from the new concurrent packages.

My initial pass used some code like:

int threadCount = 10;
ExecutorService pool = Executors.newFixedThreadPool(threadCount);
for(Content content:contentList){
 ImageRunnable ir = new ImageRunnable();
 ir.setContent(content);
 ir.setParent(this);
 pool.execute(ir);
}

ImageRunnable had the process method. Say, the contentList is a List<Content> with 10,000 content items. What would happen is that this code would complete immediately with 10,000 ImageRunnables created and added to the ThreadPool’s blocking queue. This would actually work and the 10 threads would process the work to be done. The problem is that the master Scanner thread has lost track of when all of the work submitted has been completed. So it would be very simple for the Scanner to get ahead of itself and keep adding items to the blocking queue until errors start getting thrown due to lack of memory. So all of the benefits of using the Scanner have been removed. What I wanted to have happen was that the pool.execute(ir) call would block if the 10 threads are currently working. So the way that I came across to get this done was to use a Semaphore object. The code to do this now looks like:

int threadCount = 10;
ExecutorService pool = Executors.newFixedThreadPool(threadCount);
Semaphore permits = new Semaphore(threadCount);
for(Content content:contentList){
 ImageRunnable ir = new ImageRunnable();
 ir.setContent(content);
 ir.setParent(this);
 try {
   permits.acquire();
   pool.execute(ir);
 } catch (InterruptedException e) {
   e.printStackTrace();
 }
}

The trick is that the process method then has to call release() when it has completed its work. But now the thread blocks at permits.acquire() until a slot is open. This does exactly what I want by not allowing Runnables to be submitted to the blocking queue until there is a thread ready to take it. Now the check pointing and rate limiting work exactly as with a single threaded Scanner, but it can use multiple threads.

Subscribe to search results

January 20th, 2010

I’ve now added the ability to subscribe to search results. If you’re a logged in user, on the search results page and on Category pages, you should see a subscribe button. By clicking on this, you will be able to create a new feed that includes items that match the search that you were just searching for.

Initially you’ll get the lastest 20 items, but as new items are added you’ll see them in your new feed. This is an easy way to keep track of things that you’re interested in, but from sources that you might not be aware of. Just be careful of adding too broad of a term that will produce thousands of results a day. And of course let me know if you’ve got any feedback.

Content snippets now have images

January 20th, 2010

I’ve updated the code to start examining content items to see if there is a primary image associated with the post. If there is an image there, then get the URL and size of that image. This is so that when the content item is displayed in snippet form a thumbnail of the image can be included. The size of the original is required for proper resizing. Currently I’m just using a browser based thumbnail of the original since I don’t want to have any issues with serving and storing all of these images.

This thumbnail image is put on the page after page load has completed so that if there are any speed issues with fetching the image it won’t impact the overall load time of the page. This same method has been used for the related items as well.

Chrome issues fixed

January 20th, 2010

There were all sorts of updates that went out last night. One of which was a fix for the order of Folders on the News page on Chrome. All of the other major browsers keep the default order of an array in JavaScript based on insert order. It appears that the spec actually doesn’t specify that this is a requirement though and Chrome is the only browser that handles it in a different way. Here’s a post on Stack Overflow that explains exactly what was happening.

The fix was to do an explicit sorting of the data, this way it tells Chrome exactly how you want it. Google had originally said that this wasn’t a bug and that developers should just correct their code, but it appears that they might be back tracking on that.

Odd issue with favicon.ico

January 16th, 2010

Awhile ago, I added a feature that would try and add the favicon.ico for a site next to its feed name. Most sites have this image in a common location at http://www.site.com/favicon.ico. It’s also possible to set another location with a tag in the head of the html page.  Since I didn’t feel that it was critical to have this, just a nice to have, I added in the img tags with an onError handler that would hide the image if it wasn’t in the default location.

These images created a bit of an issue though. Several sites would serve a large 404 page with ads and everything if the favicon wasn’t in the default location. The ReadPath page would do the correct thing and hide the image, but behind the scenes it turns out that the browser could do quite a bit of excessive downloading if the 404 pages were too large. Instead of loading a minimal 16 x 16px image it could end up loading a huge 404 page. With 20 stories on the page it’s possible to load up a whole lot of things that will never be displayed and just bog down the browser.

I’ve since changed the behavior so that the favicon images are only loaded after the page itself has finished loading. This should cure the speed issues, but still leaves the possibility of the browser loading a lot of things that aren’t necessary. I thought about caching a copy of the image and keeping a flag on whether the image existed, this would sidestep all of the performance issues, but creates all sorts of other issues. So for the time being I’ll leave it with the delayed load.