sv.pm.org: Building a Web-Scale Search Engine with Perl

blekko logo Greg Lindahl, Founder and CTO of Blekko.com, an alternative search engine, gave a fascinating talk tonite at Silicon Valley Perl Mongers on “Building a Web-Scale Search Engine with Perl.”

“Over the past 5 years we’ve built a web-scale search engine on top of Perl and XS, with 1,500 servers, 20 petabytes of disk and 1/2 petabyte of SSD, a 4 billion webpage crawl and index, and daily code deployments. Along the way we wrote a NoSQL database and a lot of XS, used over 600 CPAN distros, upgraded from 5.8.8 to 5.16, and had a lot of fun. Come hear about the mistakes we made, and the lessons we learned.”

Executive Summary

Greg’s search engine system design precisely positions data and calculating throughout the data center exactly where it is needed, reflecting his background in supercomputing performance optimization. Blekko’s accomplishment is that they are combining 2,000 curated categories with good clustering to present relevant search engine result pages (SERPs) – beyond the capabilities of keyword matching. “Samsung ruby wine” is not a programming query, and “skinny mature free” is an adult query – keyword matching gets those wrong. This is done with their own Cassandra-like NoSQL database and a million lines of Perl in a “Higher-Order Perl” functional style, plus 80,000 lines of C.

Key Perl CPAN Modules

– IO::AIO
– JSON::XS
– Anyevent

– aside from Blekko, the Microsoft Bing rewrite is the only new search engine, and it loses $1 billion/year
– google is so dominant that all of the other search engine companies are allies, including Bing
– Blekko has received $53 million so far
– “10 blue links” – others copy google SERP layout to the pixel
– goal is to have long runway after launch, unlike all their dead competitors (cuil, powerset, etc.)
– dataflow frame
– Jenkins framework is so tiny, just replace it if you don’t like it
– 3 copies of data in 3 different clusters on 3 different switches

– colo has an advantage of economy of scale at 100 servers over the Cloud
– sold 600 old servers, now 850 servers left
– need 300 crawler nodes and 300 serving nodes
– CPU load is 24×7, varies by 2x, so not elastic enough for Cloud to reduce cost
– more spindles results in better performance
– nodes are 2x160GB SSD, 10x2TB disks, 96 GB RAM
– don’t abuse SSD and it will last a long time
– first DC was 668 servers, including 500 HP
– server hardware monoculture is good
– “not that i hate colos …”

– had to write their own database to know its performance tradeoffs and how to fix it
– “we only AB test stuff that’s not important”
– test coverage metrics are a waste of time when considering that for example, compilers obscure bugs
– we use the cowboy development system, but don’t screw up too much or else. so write tests
– 32 engineers now
– 3 mechanical guys
– eschews Java

– “we’d love to Open Source our infrastructure code”, but there needs to be a community request. ie. nagios push redesign
– logstash?
– students of search engine design should start with CommonCrawl data on AWS. It is spammy now, but will have improve with relevance information from Blekko in 6 months or so.
– uses “swipey” adjective to describe iPad-style gestural UI interfaces, possibly related to swipey tabs

It was an excellent talk. One attendee said, “it was the best talk I’ve heard in two years.” Encore at Yet Another Perl Conference (YAPC) next week.

Thanks once again to the Plug & Play Tech Center for hosting the meeting.

Slides
meetup: Silicon Valley Perl
SVLUG: Greg Lindahl on Blekko Search Engine (2010)
Greg Lindahl’s Homepage
The Anatomy of Search Technology: blekko’s NoSQL database

sv.pm.org: Building a Web-Scale Search Engine with Perl

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Bureau of Internal Revenue: Regional Offices (Directory)

Telangana State MP MLA Mobile Numbers Full Information

Xerox 7545 - Scanner Fault - Fault Code 362.476.00

olga and nick bacala 1948 - au...

Plymouth men charged with violence and having weapons

System Center Configuration Manager で、できる限り早く更新プログラムを適用する方法について

Wal-Mart Hold Up Guy Charged – 8th Sexual Assault of 2014 in Cornwall Ontario...

Unread email ews Powershell Module with reply and forward counting

Northamptonshire crime family jailed for running drug farms

クラスター環境における修正プログラムの適用手順: WindowsServer2012 以降

DB Connect error "sslmode and sslfactory are in conflict" when attempting to...

WS 2012 and NAS backup, failed to prepare the backup image

TBT: Bradez “One Gallon” Ft Okyeame Kwame (Prod by Appietus)

Watch! Bibleway Intersessory Prayer Minister Rachel-Yvonne McIntosh And...

Pittsburgh man accused of CSC at Electric Forest.

Re: EX2200 at Loader Prompt "cannot open package (error 60)" when trying to tftp

Change Charset of SSIS package from Unicode to Latin 1

Assigning SAP reserved function keys (F1, F3 & F4) in Custom Push button of...

FortiManager in the Security Fabric