Basic Introduction to Natural Language Processing ( NLP)

4 min readFeb 19, 2021

Communicating effectively is the most important of all life skills. It enables us to transfer information to the other people and to understand what is said to us. Some examples includes communicating new ideas ,expressing your feelings vocally ( example voice) ,written (example printed articles, emails,sites ), visually (example logos,charts or graphs ) or non-verbally (example body language or gestures).

When it comes to computer it’s designed to understand it’s native language or in other words what we call it machine language or machine code which is incomprehensible to many people . Communication between Humans and computer is not in words but with millions of zeros and one which we call it as binary numbers.

Few decades back programmers used punch cards to communicate with computers and recently layman are using multiple devices like Alexa,Siri to communicate and to get there work done.

Let‘s take a closer look how this devices are now able to communicate with humans and how it has been made possible.

Natural Language Processing (NLP)

Natural Language Processing or NLP is a sub field of Linguistics, Computer Science,Information Technology and Artificial Intelligence that gives the machines the ability to understand,read, speak and derive meaning from human languages.

Few Examples where we use NLP daily

Autocomplete

Spell Check

Voice Text messaging

Spam Filters

Related keywords on search engines

Siri,Alexa or Google Assistant

Text Preprocessing

It’s a process which simply means to bring your text into form that is predictable and analyzable to build models

Steps of preprocessing

Tokenization

Stopword Removal

N-Grams

Stemming

Word Sense Disambiguation

Count Vectorizer

TF-IDF (TFIDF Vectorizer)

Hashing Vectorizer

TOKENIZATION : Task of breaking a text into pieces is called Token

There are two type of Tokenization

Word Tokenization : Sentence is broken into multiple words

Sentence Tokenization: It’s a process of splitting text into multiple sentences

Stopword Removal: Stopwords are the words which does not add much meaning to a sentence . They can be safely ignored without sacrificing the meaning of the sentence.

N-Grams: Is a contiguous sequence of n items from a given sample of text or speech. Are extensively used in Text Mining and natural language processing tasks.

Stemming: It’s the process of reducing a word to its word stem that affixes to suffixes and prefixes or it’s a process of reducing inflected words to there word stem

Word Sense Disambiguation: It identifies which sense of a word (i.e meaning ) is used in a sentence , when the word has multiple meanings.

Count Vectorizer: It tokenizes the text along with performing very basic preprocessing and count for the number of times each word appeared in documents

Word missing in the list are weighted as 0 — **Count Vectorizer**

TF-IDF (TFIDF Vectorizer): It stands for “term frequency-inverse document frequency” .

It’s a word frequency scores that try to highlights words that are more interesting . The importance is in scale of 0 and 1

Term Frequency (TF): This summarizes how often a given word appears in a document.

Sent 1 : Good Boy

Sent 2 : Good Girl

Sent 3: Boy Girl Good

TF = ( Number of repeated words in a SENTENCE ) / (Number of words in a SENTENCE)

Inverse Document Frequency (IDF): This down scales words that appears a lot within a document

IDF = LOG ( Number of sentences / Number of sentences containing words )

TF * IDF

Hashing Vectorizer:Convert a collection of text documents to a matrix of token occurrences. This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

Thank you for viewing my First Post. Please provide your suggestion and feedback . I would surely try to update current post and it would Encourage me to share few more posts