|
There is a large document. which contains millions of words.
Microsoft Interview Questions and Answers
(Continued from previous question...)
30. There is a large document. which contains millions of words.
Question:
There is a large document. which contains millions of words. so how you will calculate a each word occurrence count in an optimal way?
maybe an answer1:
the solution is to iterate over each word from the beginning, while adding each word to the Trie. When we try to add a word that already in, then we update the $->counter of that word to be $->counter=$->counter+1.
Note: $ is the closing character of each word on the Trie. To $ we add a field called counter that count each occurrence of that word.
As voora suggested, reading about Trie might help you understand the solution.
maybe an answer2:
Write a map reduce job for word count.
Mapper: tokenize a line and emit (word, 1)
Reducer: add all values for the key and save the sum corresponding to the word
For performance use combiner which is same as reducer.
(Continued on next question...)
Other Interview Questions
|