Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

mongodb - Mongo: count the number of word occurrences in a set of documents

I have a set of documents in Mongo. Say:

[
    { summary:"This is good" },
    { summary:"This is bad" },
    { summary:"Something that is neither good nor bad" }
]

I'd like to count the number of occurrences of each word (case insensitive), then sort in descending order. The result should be something like:

[
    "is": 3,
    "bad": 2,
    "good": 2,
    "this": 2,
    "neither": 1,
    "nor": 1,
    "something": 1,
    "that": 1
]

Any idea how to do this? Aggregation framework would be preferred, as I understand it to some degree already :)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

MapReduce might be a good fit that can process the documents on the server without doing manipulation on the client (as there isn't a feature to split a string on the DB server (open issue).

Start with the map function. In the example below (which likely needs to be more robust), each document is passed to the map function (as this). The code looks for the summary field and if it's there, lowercases it, splits on a space, and then emits a 1 for each word found.

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

Then, in the reduce function, it sums all of the results found by the map function and returns a discrete total for each word that was emitted above.

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

Finally, execute the mapReduce:

> db.so.mapReduce(map, reduce, {out: "word_count"})

The results with your sample data:

> db.word_count.find().sort({value:-1})
{ "_id" : "is", "value" : 3 }
{ "_id" : "bad", "value" : 2 }
{ "_id" : "good", "value" : 2 }
{ "_id" : "this", "value" : 2 }
{ "_id" : "neither", "value" : 1 }
{ "_id" : "or", "value" : 1 }
{ "_id" : "something", "value" : 1 }
{ "_id" : "that", "value" : 1 }

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...