Finding the Most Popular Article on a Site

Just recently I found myself trying to explore some of the new features released along with version 2.0 of Facebook’s Graph Api and checking out the new documentation.  Unfortunately, I found nothing of interest with version 2.0, but I did have my mind blown to find a very useful but little known existing feature of the Graph API.

Without even procuring an access token, you can make calls to the Facebook API and pass in a website for the ‘id’ parameters, and you can get some results about the site according to Facebook:

https://graph.facebook.com/v1.0/?id=http://scottlobdell.me

Note that as v1.0 is deprecated, v2.0 will require an access token along with the request, but that’s a fairly trivial hurdle.  Otherwise, by making this request, Facebook will return JSON data that will look something like this:

{
   "id": "http://scottlobdell.me",
   "shares": 5384
}

Note that since this is my own website, I may or may not have modified the response for this blog to make myself look more hip and cool.  You can also make calls with multiple sites:

https://graph.facebook.com/v1.0/?ids=http://xkcd.com,http://pgparks.com

And get back moar data:

{
   "http://xkcd.com": {
      "id": "http://xkcd.com",
      "shares": 376835,
      "comments": 15
   },
   "http://pgparks.com": {
      "id": "http://pgparks.com",
      "shares": 209
   }
}

This very basic yet powerful functionality let me to want to answer a fairly straightforward question that a computer would be useful for: Given a base domain, what is the most popular page on that site according to Facebook?  I subsequently made a simple website crawler, and all of the code can be found here on Github.

I used Beautiful Soup to quickly and easily scrape and parse HTML to extract links, and I used GEvent to make asynchronous network calls. Combined with the Facebook API examples above, I could crawl a domain and find the share count from every single page on that domain and find the most popular thing ever posted on that site by ranking the results from most to least popular.

We can step through all of the code below:

if __name__ == "__main__":
    monkey.patch_socket()
    monkey.patch_ssl()
    if len(sys.argv) > 1:
        url = sys.argv[1]
        domain_crawler = FacebookContentFinder(url)
        print domain_crawler.crawl()
    else:
        print "Pass in a website 2nd arg"

Starting with the beginning of the script, we first invoke the totally unobtrusive power of GEvent and monkey patch core network functionality to make network calls non-blocking.  Without calling patch_socket() and patch_ssl(), network calls will remain serialized.

The rest of this portion of the script is fairly straightforward: force the end user to pass in a URL as the second parameter on the command line and then instantiate an instance of a FacebookContentFinder class defined below:

class FacebookContentFinder(object):
    def __init__(self, base_domain):
        self.base_domain = base_domain.strip("/")
        self.all_links = set()
        self.min_str_length, self.max_str_length = self._establish_min_max_ignorables()
        self.pool = gevent.pool.Pool(POOL_SIZE)
        self.greenlets = []

There are a few things we need to keep track of throughout the process.  The base domain establishes the starting point of the crawl.  We parse each link of every page, so it’s quite possible to create infinite loops, so we can track visited pages by using a python set.  Min_str_length and max_str_length are beyond the scope of basic functionality and are used as helpers to ignore non-html files.  See full file for additional context.

Pool and greenlets reference gevent-specific logic.  By using a gevent pool, we can restrict the total number of greenlets (or “emulated threads”, if you will (and I will)) to a certain size.  We maintain a list of the greenlets that we spawn so that we can eventually block at a certain point to ensure completion of all of these “threads” (if you will).

Now we can kick off the crawl for the object:

    def crawl(self):
        self.greenlets.append(self.pool.spawn(self.find_links_in_url, self.base_domain))
        gevent.joinall(self.greenlets)
        return self.get_facebook_scores()

find_links_in_url is the method defined below, and base_domain represents the single argument we pass as a parameter to that method.  This particular method is recursive in nature and will re-call itself for every single link in every single page.  The logic is defined here:

    def find_links_in_url(self, url):
        if url is None:
            return
        url = self._reformat_url(url)
        if self._should_ignore(url):
            return
        self.all_links.add(url)
        print "Found %s" % url
        try:
            response = requests.get(url)
        except requests.exceptions.SSLError:
            print "SSL error on %s, skipping" % url
            return
        except requests.exceptions.ConnectionError:
            print "Connection error on %s, skipping" % url
            return
        data = response.text
        soup = BeautifulSoup(data)
        for link in soup.find_all('a'):
            url = link.get('href')
            self.find_links_in_url(url)
            self.greenlets.append(self.pool.spawn(self.find_links_in_url, url))

Note that this method has a few helper functions.  You can see the original file for full context, but hopefully the methods defined in pure english are relatively self-explanatory.  For additional context for what these do, please see a dictionary.

The important part is that we get all links in each page, and recursively call this method again to do the same thing on the next page.  In calling the method, we spawn another greenlet which will be added to the gevent pool that will be executed asynchronously.

Finally, we can block until the entire site has been crawled.  If your eyes haven’t glazed over, you remember that we maintained a set of visited URL’s that we’ve crawled.  We can then pass those URL’s to Facebook to find out which ones were popular, and we can find our proverbial prom queen:

    def get_facebook_scores(self):
        site_to_shares = {}
        all_links = list(self.all_links)
        for list_of_links in chunks(all_links, len(all_links), 50):
            all_urls_str = ",".join(list_of_links)
            print "Making a call to Facebook..."
            response = requests.get(FACEBOOK_QUERY % all_urls_str)
            json_data = json.loads(response.text)
            for url in json_data.keys():
                site_data = json_data[url]
                shares = site_data.get("shares")
                if shares:
                    site_to_shares[url] = shares
        site_to_shares = OrderedDict(sorted(site_to_shares.items(), key=lambda t: t[1], reverse=True))
        return site_to_shares

Here, I’m simply batching all of the URL’s into chunks of 50 items and then making a call to the Facebook graph api.  I’m taking the data returned from Facebook and adding it to a dictionary of the URL to the number of shares it has on Facebook.  Finally, I can return that data as an OrderedDict which will return the keys in sorted order so that the most popular site is listed first.

Again, please see the entire python file to see the entire thing in action and feel free to give it a whirl.