Vous êtes sur la page 1sur 3

sign up log in tour help

_
Unix & Linux Stack Exchange is a Here's how it works:
question and answer site for users of
Linux, FreeBSD and other Un*x-like
operating systems. Join them; it only
takes a minute:
Anybody can ask Anybody can The best answers are voted
Sign up a question answer up and rise to the top

find n most frequent words in a file

I want to find, say, 10 most common word in a text file. Firstly, solution should be optimized for keystrokes (in other words - my time).
Secondly, for the performance. Here is what I have so far to get top 10:

cat test.txt | tr -c '[:alnum:]' '[\n*]' | uniq -c | sort -nr | head -10


6 k
2 g
2 e
2 a
1 r
1 k22
1 k
1 f
1 eeeeeeeeeeeeeeeeeeeee
1 d

I could make a java, python etc. program where I store (word, numberOfOccurences) in a dictionary and sort the value or I could use
MapReduce, but I optimize for keystrokes.

Are there any false positives? Is there a better way?

/ command-line / shell-script

asked Jun 24 '12 at 0:07


Lukasz Madon
218 1 2 7

5 Answers

That's pretty much the most common way of finding "N most common things", except you're
missing a sort , and you've got a gratuitious cat :

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head -10

If you don't put in a sort before the uniq -c you'll probably get a lot of false singleton words.
uniq only does unique runs of lines, not overall uniquness.

EDIT: I forgot a trick, "stop words". If you're looking at English text (sorry, monolingual North
American here), words like "of", "and", "the" almost always take the top two or three places.
You probably want to eliminate them. The GNU Groff distribution has a file named eign in it
which contains a pretty decent list of stop words. My Arch distro has
/usr/share/groff/current/eign , but I think I've also seen /usr/share/dict/eign or
/usr/dict/eign in old Unixes.

You can use stop words like this:

tr -c '[:alnum:]' '[\n*]' < test.txt |


fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head -10

My guess is that most human languages need similar "stop words" removed from meaningful
word frequency counts, but I don't know where to suggest getting other languages stop words
lists.

EDIT: fgrep should use the -w command, which enables whole-word matching. This avoids
false positives on words that merely contain short stop works, like "a" or "i".

edited Feb 10 '16 at 6:33 answered Jun 24 '12 at 0:35


polemon Bruce Ediger
4,766 6 30 67 30.3k 3 51 105
1 Does cat add some significant performance overhead? I like the pipe syntax. What does the * in '[\n*]' do?
Lukasz Madon Jun 24 '12 at 1:04

If you like the "cat test.txt", then by all means use it. I've read an article someplace where Dennis Ritchie says that
the "cat something | somethingelse" syntax is more widely used, and that the '< something' syntax was something of
a mistake, since it's single purpose. Bruce Ediger Jun 24 '12 at 22:25

What if I want to find the most common directory name in a find output? That is, split words on / instead of
whitespace characters and similar. erb Feb 15 '16 at 13:03

1 @erb - you would probably do something like: find somewhere optoins | tr '/' '\n' | sort | uniq -c |
sort -k1.1nr | head -10 Bruce Ediger Feb 15 '16 at 15:03

1 @erb - ask that as a question, not in a comment. You will have more room to frame your question, so as to get the
answer you need. Give example input, and desired output. You might get some reputation points for asking a good
question, and I will get points for giving a better answer than I can in a comment. Bruce Ediger Feb 15 '16 at 17:39

This works better with utf-8:

$ sed -e 's/\s/\n/g' < test.txt | sort | uniq -c | sort -nr | head -10

answered Aug 28 '13 at 21:45


Vladislav Schogol
71 1 1

Let's use AWK!

This function lists the frequency of each word occurring in the provided file in
Descending order:

function wordfrequency() {
awk '
BEGIN { FS="[^a-zA-Z]+" } {
for (i=1; i<=NF; i++) {
word = tolower($i)
words[word]++
}
}
END {
for (w in words)
printf("%3d %s\n", words[w], w)
} ' | sort -rn
}

You can call it on your file like this:

$ cat your_file.txt | wordfrequency

and for the top 10 words:

$ cat your_file.txt | wordfrequency | head -10

Source: AWK-ward Ruby

edited Jan 17 '15 at 16:03 answered Dec 15 '14 at 22:57


Sheharyar
301 3 6

Let's use Haskell!

This is turning into a language war, isn't it?

import Data.List
import Data.Ord

main = interact $ (=<<) (\x -> show (length x) ++ " - " ++ head x ++ "\n")
. sortBy (flip $ comparing length)
. group . sort
. words

Usage:

cat input | wordfreq

Alternatively:

cat input | wordfreq | head -10


answered Oct 3 '15 at 3:33
BlackCap
181 1 6

Something like this should work using python which is commonly available:

cat slowest-names.log | python -c 'import collections, sys; print collections.Counter


(sys.stdin);'

This assumes word per line. If there are more, splitting should be easy as well.

answered May 23 '16 at 7:26


Reut Sharabani
121 3

Vous aimerez peut-être aussi