BIKASH THAPA

Processing Text Data with NLTK

May 28th, 2019

Processing Text for Machine Learning

If you were working on Natural Language Processing (NLP), then you might sure have come across a point where you had to change a large corpus of text data into the data with certain structure and order.

In this post we'll learn how to use NLTK (Natural Language Toolkit) to process the text data.

NLP

1. sent_tokenize()

This function takes a series of sentence and returns all the sentences as individual tokens:

import nltk

raw_text = 'nltk\'s isalpha() removes the numeric tokens like 123. It also filters special characters such as $$, #, ++ etc'

sent = sent_tokenize(raw_text)

print(sent)

Output:

["nltk's isalpha() removes the numeric tokens like 123.", 'It also filters special characters such as $$, #, ++ etc']

2. word_tokenize()

This function takes a string as it's argument and returns the individual words in that string.

import nltk

raw_text = 'nltk is very useful in nlp'

words = word_tokenize(raw_text)

print(words)

Output:

['nltk', 'is', 'very', 'useful', 'in', 'nlp']

3. isalpha()

This function simply filters all the alphabets, hence removing non-alphabetical tokens such as numbers and special characters.

import nltk

raw_text = """nltk's isalpha() removes the numeric tokens like 123 or special characters such as $$, #, ++ etc.'"""

words = [w for w in raw_text if w.isalpha()]

print(words)

Output:

['n', 'l', 't', 'k', 's', 'i', 's', 'a', 'l', 'p', 'h', 'a', 'r', 'e', 'm', 'o', 'v', 'e', 's', 't', 'h', 'e', 'n', 'u', 'm', 'e', 'r', 'i', 'c', 't', 'o', 'k', 'e', 'n', 's', 'l', 'i', 'k', 'e', 'o', 'r', 's', 'p', 'e', 'c', 'i', 'a', 'l', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', 's', 'u', 'c', 'h', 'a', 's', 'e', 't', 'c']

However, sometime you need to filter just the words but not the individual character. To do this you use word_tokenize() before you use the isalpha():

import nltk

raw_text = """nltk's isalpha() removes the numeric tokens like 123 or special characters such as $$, #, ++ etc.'"""

words = word_tokenize(raw_text)

words = [w for w in words if w.isalpha()]

print(words)

Output:

['nltk', 'isalpha', 'removes', 'the', 'numeric', 'tokens', 'like', 'or', 'special', 'characters', 'such', 'as', 'etc']

However, sometime we do not want to remove some special words in our data. Following code doesn't filter the words 'C++', '#', & '.NET':

import nltk

raw_text = 'nltk\'s isalpha() removes the non-alphabetical characters but we can customize the code to me it not filters special characters such as C++, $$, #, .NET etc'

words = word_tokenize(raw_text)

words = [w for w in words if w in {'C++', '$$', '#', '.NET'} or w.isalpha()]

print(words)

Output:

['nltk', 'isalpha', 'removes', 'the', 'characters', 'but', 'we', 'can', 'customize', 'the', 'code', 'to', 'me', 'it', 'not', 'filters', 'special', 'characters', 'such', 'as', 'C++', '#', '.NET', 'etc']

4. stopwords

In machine learning, it is important to filter the most common words. Following are the default common words in english language that come with NLTK library, which are called stop words:

- - - -

i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't

- - - -

However, you can add any english words here based on the dataset you're working. We filter the stop words in our code as follow:

import nltk

raw_text = 'nltk\'s isalpha() removes the non-alphabetical characters but we can customize the code to me it not filters special characters such as C++, $$, #, .NET etc'

stop_words = set(stopwords.words('english'))

words = word_tokenize(raw_text)

words = [w for w in words if w in {'C++', '$$', '#', '.NET'} or w.isalpha()]

words = [w for w in words if not w in stop_words]

print(words)

Output:

['nltk', 'isalpha', 'removes', 'characters', 'customize', 'code', 'filters', 'special', 'characters', 'C++', '#', '.NET', 'etc']

As shown in the output, the stopwords has beed removed from the output.

There is more you can do with NLTK than this, which I will cover later.