Trigrams, Bigrams and Ngrams in Python for Text Analysis

Written by Sean Behan on Mon Mar 06th 2017

Creating trigrams in Python is very simple

trigrams = lambda a: zip(a, a[1:], a[2:])
trigrams(('a', 'b', 'c', 'd', 'e', 'f'))
# => [('a', 'b', 'c'), ('b', 'c', 'd'), ('c', 'd', 'e'), ('d', 'e', 'f')]

You can generalize this a little bit more

ngrams = lambda a, n: zip(*[a[i:] for i in range(n)])
bigrams = ngrams(('a', 'b','c', 'd','e', 'f'), 2)
# [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('e', 'f')]

When analyzing text it's useful to see frequency of terms that are used together.

txt = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis lorem ipsum aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint lorem ipsum occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'.lower().split()
print ngrams(txt, 2)

You can use a Counter from the collections module to see most common features

from collections import Counter
Counter(ngrams(txt, 2)).most_common(5)
[(('lorem', 'ipsum'), 3),
 (('consequat.', 'duis'), 1),
 (('in', 'voluptate'), 1),
 (('consectetur', 'adipisicing'), 1),
 (('ipsum', 'dolor'), 1)]

Tagged with..
#Python #Text Analysis #Ngrams #Trigrams #Bigrams #Functional Programming

Just finishing up brewing up some fresh ground comments...

Trigrams, Bigrams and Ngrams in Python for Text Analysis

SB