Sunday, February 6, 2011

interview questions : distributed sorting

if you have a large log which contains urls and queries, design a algorithm finding top 100 most frequent searched queries
1. extract the queries from the log first, then put in the hashmap, get the top 100
2. what is limitation of hashmap?
might be too limited if the queries are too many. in memory operation
3. what would you do instead?
add more CPUs, distributed into multiple computers
4. what if the file is still too large?
divide the file in chunks, say each file only get queries starting with A.

No comments:

Post a Comment