Greg Lindahl, Founder and CTO of Blekko.com, an alternative search engine, gave a fascinating talk tonite at Silicon Valley Perl Mongers on “Building a Web-Scale Search Engine with Perl.”
“Over the past 5 years we’ve built a web-scale search engine on top of Perl and XS, with 1,500 servers, 20 petabytes of disk and 1/2 petabyte of SSD, a 4 billion webpage crawl and index, and daily code deployments. Along the way we wrote a NoSQL database and a lot of XS, used over 600 CPAN distros, upgraded from 5.8.8 to 5.16, and had a lot of fun. Come hear about the mistakes we made, and the lessons we learned.”
Executive Summary
Greg’s search engine system design precisely positions data and calculating throughout the data center exactly where it is needed, reflecting his background in supercomputing performance optimization. Blekko’s accomplishment is that they are combining 2,000 curated categories with good clustering to present relevant search engine result pages (SERPs) – beyond the capabilities of keyword matching. “Samsung ruby wine” is not a programming query, and “skinny mature free” is an adult query – keyword matching gets those wrong. This is done with their own Cassandra-like NoSQL database and a million lines of Perl in a “Higher-Order Perl” functional style, plus 80,000 lines of C.
Key Perl CPAN Modules
– IO::AIO
– JSON::XS
– Anyevent
– aside from Blekko, the Microsoft Bing rewrite is the only new search engine, and it loses $1 billion/year
– google is so dominant that all of the other search engine companies are allies, including Bing
– Blekko has received $53 million so far
– “10 blue links” – others copy google SERP layout to the pixel
– goal is to have long runway after launch, unlike all their dead competitors (cuil, powerset, etc.)
– dataflow frame
– Jenkins framework is so tiny, just replace it if you don’t like it
– 3 copies of data in 3 different clusters on 3 different switches
– colo has an advantage of economy of scale at 100 servers over the Cloud
– sold 600 old servers, now 850 servers left
– need 300 crawler nodes and 300 serving nodes
– CPU load is 24×7, varies by 2x, so not elastic enough for Cloud to reduce cost
– more spindles results in better performance
– nodes are 2x160GB SSD, 10x2TB disks, 96 GB RAM
– don’t abuse SSD and it will last a long time
– first DC was 668 servers, including 500 HP
– server hardware monoculture is good
– “not that i hate colos …”
– had to write their own database to know its performance tradeoffs and how to fix it
– “we only AB test stuff that’s not important”
– test coverage metrics are a waste of time when considering that for example, compilers obscure bugs
– we use the cowboy development system, but don’t screw up too much or else. so write tests
– 32 engineers now
– 3 mechanical guys
– eschews Java
– “we’d love to Open Source our infrastructure code”, but there needs to be a community request. ie. nagios push redesign
– logstash?
– students of search engine design should start with CommonCrawl data on AWS. It is spammy now, but will have improve with relevance information from Blekko in 6 months or so.
– uses “swipey” adjective to describe iPad-style gestural UI interfaces, possibly related to swipey tabs
It was an excellent talk. One attendee said, “it was the best talk I’ve heard in two years.” Encore at Yet Another Perl Conference (YAPC) next week.
Thanks once again to the Plug & Play Tech Center for hosting the meeting.
Slides
meetup: Silicon Valley Perl
SVLUG: Greg Lindahl on Blekko Search Engine (2010)
Greg Lindahl’s Homepage
The Anatomy of Search Technology: blekko’s NoSQL database