Easily build a great vocabulary without studying! Now on the App Store!
Click here for more info about the app

Trigrams, Bigrams and Ngrams in Python for Text Analysis


Creating trigrams in Python is very simple

trigrams = lambda a: zip(a, a[1:], a[2:])
trigrams(('a', 'b', 'c', 'd', 'e', 'f'))
# => [('a', 'b', 'c'), ('b', 'c', 'd'), ('c', 'd', 'e'), ('d', 'e', 'f')]

You can generalize this a little bit more

ngrams = lambda a, n: zip(*[a[i:] for i in range(n)])
bigrams = ngrams(('a', 'b','c', 'd','e', 'f'), 2)
# [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('e', 'f')]

When analyzing text it's useful to see frequency of terms that are used together.

txt = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis lorem ipsum aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint lorem ipsum occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'.lower().split()
print ngrams(txt, 2)

You can use a Counter from the collections module to see most common features

from collections import Counter
Counter(ngrams(txt, 2)).most_common(5)
[(('lorem', 'ipsum'), 3),
 (('consequat.', 'duis'), 1),
 (('in', 'voluptate'), 1),
 (('consectetur', 'adipisicing'), 1),
 (('ipsum', 'dolor'), 1)]
Tagged w/ #python #text analysis #ngrams #trigrams #bigrams #functional programming