I have a text file with hundred of lines.
There are multiple examples about obtaining most repeated words, and some of them obtaining most repeated lines or long phrases.
But I would like to obtain all the phrases of 3 words and just filter to have the most repeated or frequent phrases (of 3 words).
It would be easy to obtain the most repeated or frequent phrases by just ” | sort | uniq -c | sort -n”…
But I need to obtain all the possible phrases of 3 words in a file.
How could I?
Thanks
Tried this script from somewhere:
#!/usr/bin/perl -w
use strict;
use Set::Tiny;
# a partial list of common articles, prepositions and small words joined into
# a regex.
my $sw = join("|", qw(
a about after against all among an and around as at be before between both
but by can do down during first for from go have he her him how
I if in into is it its last like me my new of off old
on or out over she so such that the their there they this through to
too under up we what when where with without you your)
);
my %sets=(); # word sets for each title.
my %titles=(); # count of how many times we see the same title.
while(<>) {
chomp;
# take a copy of the original input line, so we can use it as
# a key for the hashes later.
my $orig = $_;
# "simplify" the input line
s/[[:punct:]]//g; #/ strip punctuation characters
s/^s*|s*$//g; #/ strip leading and trailing spaces
$_=lc; #/ lowercase everything, case is not important.
s/b($sw)b//iog; #/ optional. strip small words
next if (/^$/);
$sets{$orig} = Set::Tiny->new(split);
$titles{$orig}++;
};
my @keys = (sort keys %sets);
foreach my $title (@keys) {
next unless ($titles{$title} > 0);
# if we have any exact dupes, print them. and make sure they won't
# be printed again.
if ($titles{$title} > 1) {
print "$titlen" x $titles{$title};
$titles{$title} = 0;
};
foreach my $key (@keys) {
next unless ($titles{$key} > 0);
next if ($key eq $title);
my $intersect = $sets{$key}->intersection($sets{$title});
my $k=scalar keys %{ $intersect };
#print STDERR "====>$k(" . join(",",sort keys %{ $intersect }) . "):$title:$keyn" if ($k > 1);
if ($k >= 3) {
print "$titlen" if ($titles{$title} > 0);
print "$keyn" x $titles{$key};
$titles{$key} = 0;
$titles{$title} = 0;
};
};
};
But it does not give 3 words combinations only.