Natural Language Processing:
Natural Language Processing (NLP) is applying models to text and language. It’s task is to teach machine what is said in spoken and written word is the focus of Natural Language Processing. Important terminologies of Natural Language Processing are as follows:
Stemming is the process of eliminating affixes from a word to obtain a word stem.For eg. “confusion”, “confuse” and “confusing” are all stemmed to “confuse”. Purpose is to bring it to their root/original form to make it look normal for further process.
Eg. Attraction —> Attract
Documents is basically a sentence for eg.”Rahul thought of completing assignment today in order to avoid any last minute delay”.
Tokenization is an early stage step in Natural Language Processing. It is a step of splitting large amount of text into sentences and further sentences are tokenized into words. Tokenization represents words for eg: “world”,”snake”,”fear”.
Stop Words are the words which are removed or filtered out before further processing of text, as these words carry very little or no meaning at all and doesn’t contribute in any way in further processing. It is the most common type of words in a language. For eg. “a”, “the”, “and”,”to” etc. It will be removed while analyzing text.
Corpus is basically collection of texts or documents. for eg. corpus contains 10 documents i.e. 10 text files.
6. Sparse Terms:
Sparse terms are the terms which occurs only in few documents(sentences). They are not repetitive.
7. Document Term Matrix:
Document term matrix is a matrix consisting of documents in a row and terms in columns.
8. Parts-of-Speech (POS) Tagging:
Parts of speech tagging is the process consists of category tagging to the tokenized part of a sentence.It tags each and every word in the document and assigns parts-of-speech to it like noun,verb, adjective etc.
Normalization of text is another important part of natural language processing. It is important to normalize text before further processing to provide a level playing field to all the text.
Normalization is important to convert all the text to same case (upper case or lower case), removing punctuation, etc.
10. Bag of Words:
Bag of words basically omits or doesn’t take into consideration grammer and order of words. Each and every documents or sentence is a bag of words ignoring both grammer and word order. for eg. “travel places” or ” places travel” will have the same probability score.