With the rise of Alexa, Google Home, Siri etc, one thing is certain that NLP powered assistants are going to become an important technology to developers. I will explain what exactly is NLP (Nature Language Processing), its technology and applications. by Aswathi Nambiar, 5 min read.
I am sure, we all are mesmerized by Amazon’s Alexa and are eagerly waiting to own one, at least I am. Out of sheer curiosity, I decided to check out what really goes into the making of this wonderful voice-activated system and stumbled upon “Natural Language Processing Systems”.
So what exactly is Natural Language Processing (NLP)?
As humans, we may speak in Hindi, English, Chinese etc but the computer’s language is not the one with words but with lots of ones and zeros. NLP is a way to interact with the computers in a smart and efficient way. In layman terms, NLP helps computers understand and interpret human language.
Applications of NLP?
NLP is everywhere even if we don’t realize it. Does your Gmail automatically correct you when you try to send a mail without an attachment that you referenced in the text of your mail? That is NLP working for you. Machine Translation, Speech Recognition, Automatic Summarization, Sentiment Analysis, Text Mining etc are some of the applications of NLP.
Libraries of NLP
1) Natural Language Toolkit (NLTK)
2) Apache OpenNLP
3) Standford NLP suite
4) Gate NLP Library
Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP). It was written in Python and has a big community behind it. NLTK also is very easy to learn, actually, it’s the easiest natural language processing (NLP) library that you’ll use.
Now let’s understand a few terms in NLP
Corpus: Corpus is a large collection of texts. The corpus may be composed of written language, spoken language or both. Spoken corpus is usually in the form of audio recordings. The plural form of the corpus is corpora.
Tokenization: Tokenization is the task of chopping a sentence or a string into pieces, called tokens.
Stopwords: Sometimes, some extremely common words which would appear to be of little value in understanding the data needs to be excluded from the vocabulary entirely. These words are called stop words.
Before our sentence was: “Hello Aswathi How are you doing today”
After removing the stopwords our sentence is: “Hello Aswathi How today”. We got rid of “are” “you” “doing” which do not add value to the general tone of the sentence.
Alternative Approach: You can also build your own set of list of words, which you consider less valuable for your analysis, instead of importing the stopwords module. Following is a small example of the same.
Normalization: Predict, prediction, predicting, predictable is all the various variations of the same words “predict”. Though they mean different contextually all are similar. Normalization converts all the disparities of the word into its normal form. There are two important types of text normalization.
1) Stemming: Stemming refers to a crude heuristic process that chops off the ends of words in the hope of achieving the normalized word.
2) Lemmatization: Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Bag of Words (BoW): In order to run Machine Learning algorithms, we will have to convert the text into feature vectors. The BoW model will help us to do so, by splitting each sentence into words, then counting the number of occurrences for each word. Each unique word will act as a feature for training our model.
Tf-idf stands for Term Frequency-Inverse Document Frequency.
Term Frequency (Tf) identifies how frequently a word occurs in a document. Since every document is different in length, it is possible that a term would appear much more time in long documents than shorter ones. Thus the Term Frequency is divided by the total number of words in that document as a way of normalization.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
But we often come across many words in a document which does not contribute much but have very high frequencies such as “the”,”are”,”as”,”is” etc. One approach is to remove these stop words, which we have already seen above. Another approach is to use the Inverse Document Frequency (Idf) which measures how important a word is. Inverse Document Frequency decreases the weight of commonly used words and increases the weight of words which are rare and thus add more value to the text.
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
Tf – Idf = TF(t) * IDF(t)
source. post with permission.