Interview Questions

There is a large document. which contains millions of words.

Microsoft Interview Questions and Answers


(Continued from previous question...)

30. There is a large document. which contains millions of words.

Question:
There is a large document. which contains millions of words. so how you will calculate a each word occurrence count in an optimal way?


maybe an answer1:

the solution is to iterate over each word from the beginning, while adding each word to the Trie. When we try to add a word that already in, then we update the $->counter of that word to be $->counter=$->counter+1.
Note: $ is the closing character of each word on the Trie. To $ we add a field called counter that count each occurrence of that word.
As voora suggested, reading about Trie might help you understand the solution.


maybe an answer2:

Write a map reduce job for word count. Mapper: tokenize a line and emit (word, 1) Reducer: add all values for the key and save the sum corresponding to the word For performance use combiner which is same as reducer.

(Continued on next question...)

Other Interview Questions