UNIT-1 || Natural Language Processing || Notes -By Sandeep Kadu ||

  UNIT : 1

NLP :

  • Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. 

  • It is the branch of machine learning which is about analyzing any text and handling predictive analysis.

  • Scikit-learn is a free software machine learning library for Python programming language. Scikit-learn is largely written in Python, with some core algorithms written in Cython to achieve performance.

  • Cython is a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python.

  • Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems using a natural language such as English.

  • Processing of Natural Language is required when you want an intelligent system like robot to perform as per your instructions, when you want to hear decision from a dialogue based clinical expert system, etc.

The field of NLP involves making computers to perform useful tasks with the natural languages humans use. The input and output of an NLP system can be −

  • Speech

  • Written Text

Components of NLP

There are two components of NLP as given −

Natural Language Understanding (NLU)

Understanding involves the following tasks −

  • Mapping the given input in natural language into useful representations.

  • Analyzing different aspects of the language.

Natural Language Generation (NLG)

It is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation.

It involves −

  • Text planning − It includes retrieving the relevant content from knowledge base.

  • Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of the sentence.

  • Text Realization − It is mapping sentence plan into sentence structure.

The NLU is harder than NLG.

Difficulties in NLU

NL has an extremely rich form and structure.

It is very ambiguous. There can be different levels of ambiguity −

  • Lexical ambiguity − It is at very primitive level such as word-level.

  • For example, treating the word “board” as noun or verb?

  • Syntax Level ambiguity − A sentence can be parsed in different ways.

  • For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle or he lifted a beetle that had red cap?

  • Referential ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?

  • One input can mean different meanings.

  • Many inputs can mean the same thing.

NLP Terminology

  • Phonology − It is study of organizing sound systematically.

  • Morphology − It is a study of construction of words from primitive meaningful units.

  • Morpheme − It is primitive unit of meaning in a language.

  • Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in the sentence and in phrases.

  • Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases and sentences.

  • Pragmatics − It deals with using and understanding sentences in different situations and how the interpretation of the sentence is affected.

  • Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next sentence.

  • World Knowledge − It includes the general knowledge about the world.

Steps in NLP

There are general five steps −

  • Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and words.

  • Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and arranging words in a manner that shows the relationship among the words. The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.

NLP Steps

  • Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.

  • Discourse Integration − The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it also brings about the meaning of immediately succeeding sentence.

  • Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It involves deriving those aspects of language which require real world knowledge.

Implementation Aspects of Syntactic Analysis

There are a number of algorithms researchers have developed for syntactic analysis, but we consider only the following simple methods −

  • Context-Free Grammar

  • Top-Down Parser

Let us see them in detail −

Context-Free Grammar

It is the grammar that consists rules with a single symbol on the left-hand side of the rewrite rules. Let us create grammar to parse a sentence −

“The bird pecks the grains”

Articles (DET) − a | an | the

Nouns − bird | birds | grain | grains

Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun

= DET N | DET ADJ N

Verbs − pecks | pecking | pecked

Verb Phrase (VP) − NP V | V NP

Adjectives (ADJ) − beautiful | small | chirping

The parse tree breaks down the sentence into structured parts so that the computer can easily understand and process it. In order for the parsing algorithm to construct this parse tree, a set of rewrite rules, which describe what tree structures are legal, need to be constructed.

These rules say that a certain symbol may be expanded in the tree by a sequence of other symbols. According to first order logic rule, if there are two strings Noun Phrase (NP) and Verb Phrase (VP), then the string combined by NP followed by VP is a sentence. The rewrite rules for the sentence are as follows −

S → NP VP

NP → DET N | DET ADJ N

VP → V NP

Lexocon −

DET → a | the

ADJ → beautiful | perching

N → bird | birds | grain | grains

V → peck | pecks | pecking

The parse tree can be created as shown −

NLP Steps

Now consider the above rewrite rules. Since V can be replaced by both, "peck" or "pecks", sentences such as "The bird peck the grains" can be wrongly permitted. i. e. the subject-verb agreement error is approved as correct.

Merit − The simplest style of grammar, therefore widely used one.

Demerits −

  • They are not highly precise. For example, “The grains peck the bird”, is a syntactically correct according to parser, but even if it makes no sense, parser takes it as a correct sentence.

  • To bring out high precision, multiple sets of grammar need to be prepared. It may require a completely different sets of rules for parsing singular and plural variations, passive sentences, etc., which can lead to creation of huge set of rules that are unmanageable.

Top-Down Parser

Here, the parser starts with the S symbol and attempts to rewrite it into a sequence of terminal symbols that matches the classes of the words in the input sentence until it consists entirely of terminal symbols.

These are then checked with the input sentence to see if it matched. If not, the process is started over again with a different set of rules. This is repeated until a specific rule is found which describes the structure of the sentence.

Merit − It is simple to implement.

Demerits −

  • It is inefficient, as the search process has to be repeated if an error occurs.

  • Slow speed of working.


Lexical Processing :

What are the different levels of NLP? How do these integrate with Information Retrieval?


  • Natural Language Processing works on multiple levels and most often, these different areas synergize well with each other. This article will offer a brief overview of each and provide some example of how they are used in information retrieval.

Morphological

  • The morphological level of linguistic processing deals with the study of word structures and word formation, focusing on the analysis of the individual components of words. 

  • The most important unit of morphology, defined as having the “minimal unit of meaning”, is referred to as the morpheme.

  • Taking, for example, the word: “unhappiness”. It can be broken down into three morphemes (prefix, stem, and suffix), with each conveying some form of meaning: the prefix un- refers to “not being”, while the suffix -ness refers to “a state of being”. 

  • The stem happy is considered as a free morpheme since it is a “word” in its own right. Bound morphemes (prefixes and suffixes) require a free morpheme to which it can be attached to, and can therefore not appear as a “word” on their own.

  • In Information Retrieval, document and query terms can be stemmed to match the morphological variants of terms between the documents and query; such that the singular form of a noun in a query will match even with its plural form in the document, and vice versa, thereby increasing recall.

Lexical

  • The lexical analysis in NLP deals with the study at the level of words with respect to their lexical meaning and part-of-speech. 

  • This level of linguistic processing utilizes a language’s lexicon, which is a collection of individual lexemes. A lexeme is a basic unit of lexical meaning; which is an abstract unit of morphological analysis that represents the set of forms or “senses” taken by a single morpheme.

  • “Duck”, for example, can take the form of a noun or a verb but its part-of-speech and lexical meaning can only be derived in context with other words used in the phrase/sentence. 

  • This, in fact, is an early step towards a more sophisticated Information Retrieval system where precision is improved through part-of-speech tagging.

Syntactic

  • The part-of-speech tagging output of the lexical analysis can be used at the syntactic level of linguistic processing to group words into phrase and clause brackets. 

  • Syntactic Analysis also referred to as “parsing”, allows the extraction of phrases which convey more meaning than just the individual words by themselves, such as in a noun phrase.

  • In Information Retrieval, parsing can be leveraged to improve indexing since phrases can be used as representations of documents which provide better information than just single-word indices. 

  • In the same way, phrases that are syntactically derived from the query offers better search keys to match with documents that are similarly parsed.

  • Nevertheless, syntax can still be ambiguous at times as in the case of the news headline: “Boy paralyzed after tumor fights back to gain black belt” — which actually refers to how a boy was paralyzed because of a tumor but endured the fight against the disease and ultimately gained a high level of competence in martial arts.

Semantic

  • The semantic level of linguistic processing deals with the determination of what a sentence really means by relating syntactic features and disambiguating words with multiple definitions to the given context. 

  • This level entails the appropriate interpretation of the meaning of sentences, rather than the analysis at the level or individual words or phrases.

  • In Information Retrieval, the query and document matching process can be performed on a conceptual level, as opposed to simple terms, thereby further increasing system precision. 

  • Moreover, by applying semantic analysis to the query, term expansion would be possible with the use of lexical sources, offering improved retrieval of the relevant documents even if exact terms are not used in the query. Precision may increase with query expansion, as with recall probably increasing as well.

Discourse

  • The discourse level of linguistic processing deals with the analysis of structure and meaning of text beyond a single sentence, making connections between words and sentences. At this level, Anaphora Resolution is also achieved by identifying the entity referenced by an anaphor (most commonly in the form of, but not limited to, a pronoun). An example is shown below.

Fig 1. Anaphora Resolution Illustration

  • With the capability to recognize and resolve anaphora relationships, document and query representations are improved, since, at the lexical level, the implicit presence of concepts is accounted for throughout the document as well as in the query, while at the semantic and discourse levels, an integrated content representation of the documents and queries are generated.

  • Structured documents also benefit from the analysis at the discourse level since sections can be broken down into (1) title, (2) abstract, (3) introduction, (4) body, (5) results, (6) analysis, (7) conclusion, and (8) references.

  •  Information Retrieval systems are significantly improved, as the specific roles of pieces of information are determined as for whether it is a conclusion, an opinion, a prediction, or a fact.

Pragmatic

  • The pragmatic level of linguistic processing deals with the use of real-world knowledge and understanding how this impacts the meaning of what is being communicated. By analyzing the contextual dimension of the documents and queries, a more detailed representation is derived.

  • In Information Retrieval, this level of Natural Language Processing primarily engages query processing and understanding by integrating the user’s history and goals as well as the context upon which the query is being made. Contexts may include time and location.

  • This level of analysis enables major breakthroughs in Information Retrieval as it facilitates the conversation between the IR system and the users, allowing the elicitation of the purpose upon which the information being sought is planned to be used, thereby ensuring that the information retrieval system is fit for purpose.

Regular Expressions — An excellent tool for text analysis or NLP

  • A fascinating programming tool available within most of the programming languages — Regular expressions also called regex. It is a very powerful programming tool that is used for a variety of purposes such as feature extraction from text, string replacement and other string manipulations. 

  • A regular expression is a set of characters, or a pattern, which is used to find sub strings in a given string. for ex. extracting all hashtags from a tweet, getting email id or phone numbers etc..from a large unstructured text content.

  • In short, if there’s a pattern in any string, you can easily extract, substitute and do variety of other string manipulation operations using regular expressions. Regular expressions are a language in itself since they have their own compilers and almost all popular programming languages support working with regexes.

  • It won’t be an exaggeration to mention that without having understanding of regular expressions, it is not possible to really build a NLP based system such as chatbots, conversational UI etc.. Regex has many important elements and features that help to build a useful, fit to purpose expressions solution for string extraction or manipulation.

Let’s look at each of these in next few sections.

Common Regex Functions

  1. re.search() — detects whether the given regular expression pattern is present in the given input string. It returns a Regexobject if the pattern is found in the string, else it returns a None object.

  • re.search(patterns, text)

Within search functions, different flags can be used to perform specific operations. For example

  • ‘re.I’ — to ignore the case of the text

  • re.M — enables to search in multiple lines

  • re.search(pattern, string, flags=re.I | re.M)

2. match.start() and match.end() — Returns the index of the starting and ending position of the match found respectively.

3. re.match() — The match function will only match if the pattern is present at the very start of the string.

  • re.match(patterns, text)

4. re.sub() — It is used to substitute a substring with another substring. for e. replacing ‘color’ with ‘colour’.

  • re.sub(Pattern, Substitute, Input text)

5. finditer() or the findall() — The result of the findall() function is a list of all the matches and the finditer() function is used in a ‘for’ loop to iterate through each separate match one by one.

6. re.compile() — This function stores the regular expression pattern in the cache memory for faster searches. You need to pass the regex pattern to re.compile() function.

# without re.compile() function

result = re.search(“a+”, “abc”)# using the re.compile() function

pattern = re.compile(“a+”)

result = pattern.search(“abc”)


Quantifiers — Allows to mention and have control over how many times specific character(s) pattern should occur in the given text . for ex. In some given data which have the variations of word ‘Program’ in it. for ex. ‘Program’, ‘Programmed’, ‘Programmer’. if the requirement is to find only those words where one or more occurrence of ‘e’appearing in the word ‘Program’. You could use any of the following quantifiers to indicate the presence of specific character

  • ‘?’ — matches the preceding character zero or one time. It is generally used to mark the optional presence of a character.

  • ‘*’ — quantifier is used to mark the presence of the preceding character zero or more times.

  • ‘+’ — quantifier matches the preceding character one or more times. That means the preceding character has to be present at least once for the pattern to match the string.

  • {m,n} quantifier — There are four variants of this quantifier

  • {m, n}: Matches the preceding character ‘m’ times to ’n’ times.

  • {m, }: Matches the preceding character ‘m’ times to infinite times, i.e. there is no upper limit to the occurrence of the preceding character.

  • {, n}: Matches the preceding character from zero to ’n’ times, i.e. the upper limit is fixed regarding the occurrence of the preceding character.

  • {n}: Matches if the preceding character occurs exactly ’n’ number of times.

Each of the earlier mentioned quantifiers can be written in the form of {m,n} quantifier —

  • ‘?’ is equivalent to zero or once, or {0, 1}

  • ‘*’ is equivalent to zero or more times, or {0, }

  • ‘+’ is equivalent to one or more times, or {1, }

Whitespace — A white-space comprises of a single space, multiple spaces, tab space or a newline character (also known as a vertical space). These whitespaces will match the corresponding spaces in the string. For example, the pattern ‘ +’, i.e. a space followed by a plus sign will match one or more spaces. Similarly, you could use spaces with other characters inside the pattern. The pattern, ‘Sachin Tendulkar’ will allow you to look for the name ‘Sachin Tendulkar’ in any given string.

Parentheses — Using parentheses around some characters will help to repeat a group of characters rather than just looking for repetitions of the preceding character. This concept is called grouping in regular expression. For ex. the pattern ‘(xyz){1, 3}’ will match the following strings:

  • xyz

  • xyzxyz

  • xyzxyzxyz

Use of parentheses is also helpful when you need to extract sub-patterns out of a larger pattern. for ex. if you have textual data with dates in it and you want to extract only the year from the dates. You can use a regular expression pattern with grouping to match dates and then you can extract the any of the elements such as the day, month or the year. Grouping is achieved using the parenthesis operators. for ex. 10/03/1985”. To extract the date from this string you can use the pattern — “\d{1,2}\/\d{1,2}\/\d{4}”. Now to extract the year, you can put parentheses around the year part of the pattern. The pattern is: “^\d{1,2}/\d{1,2}/(\d{4})$”.

Pipe operator — It’s denoted by ‘|’. The pipe operator is used as an OR operator. You need to use it inside the parentheses. For example, the pattern ‘(play|walk)ed’ will match both the strings — ‘played’ and ‘walked’. The pipe operator tells that the place inside the parentheses can be either of the strings or characters.

Special Characters — The characters that are mentioned as quantifiers above such as ‘?’, ‘*’, ‘+’, ‘(‘, ‘)’, ‘{‘, etc. can also appear in the input text. In such cases to extract these specific characters you’ll need to use escape sequences. The escape sequence, denoted by a backslash ‘\’, is used to escape the special meaning of the special characters. To match a question mark, you need to use ‘\?’ (this is called escaping the character) or to match a ‘+’ sign, you need to provide ‘\+’ in the regular expression. The ‘\’ itself is a special character, and to match the ‘\’ character ,you need to use the pattern ‘\\’ to escape the backslash.

Anchors — These are special characters to perform string operations at the beginning and end of a text input.

  • Carat character‘^’ — specifies the start of the string. The character followed by the ‘^’ in the pattern should be the first character of the string in order for a string to match the pattern.

  • Dollar character ‘$’ — specifies the end of the string. The character that precedes the ‘$’ in the pattern should be the last character in the string in order for the string to match the pattern.

Both the anchors can be specified in a single regular expression itself. For example, the regular expression pattern ‘^ 01+2$’ will match any string that starts with zeroes and ends with 2 with occurrence of 1s between them. It will match ‘012’, ‘0112’, ‘01111111112’ .

Wildcard Character — the ‘.’ (dot) character is also called the wildcard character. It acts as a placeholder and can match any character. For example, if a string that starts with four characters, followed by three 0s and two 1s, followed by any two characters. The valid strings can be cghj00011op, kghj00011pk, etc. The pattern that satisfies this kind of condition would be ‘.{4}0{3}1{2}.{2}’. You can also use ‘….00011..’ where the dot acts as a placeholder which means anything can sit on the place of the dot.

Character Sets — Character sets provide lot more flexibility than just typing a wildcard or the literal characters. These are group of characters specified inside square brackets. Character sets can be specified with or without a quantifier. When no quantifier succeeds the character set, it matches only one character and the match is successful only if the character in the string is one of the characters present inside the character set. For example, the pattern ‘[a-z]ed’ will match strings such as ‘ted’, ‘bed’, ‘red’ and so on because the first character of each string — ‘t’, ‘b’ and ‘r’ — is present inside the range of the character set.On the other hand, when we use a character set with a quantifier, such as in this case — ‘[a-z]+ed’, it will match any word that ends with ‘ed’ such as ‘watched’, ‘baked’, ‘jammed’, ‘educated’ and so on. Essentially a quantifier loses its special meaning and treated as a normal character when it’s present inside the character set.

The ‘^’ has two use cases. Outside a character set it works as an anchor but when used inside a character set, it acts as a complement operator, i.e. it specifies that it will match any character other than the ones mentioned inside the character set. for ex. The pattern [0–9] matches any single digit number. On the other hand, the pattern ‘[^ 0-9]’ matches any single digit character that is not a digit.


Meta Sequences — These are basically shorthand way to write commonly used character sets in regular expressions. These are the commonly used meta-sequences.


You can use meta-sequences in two ways — either use them without the square brackets. For example, the pattern ‘\w+’ will match any alphanumeric character.Or you can write them inside the square brackets. For example, the pattern ‘[\w]+’ is same as ‘\w+’.

In summary regular expressions can be used as a handy tool with many practical applications and it is fundamental to implement NLP. Along with the applications in NLP, regular expressions can be used for various other applications such as checking if input given by user in a form is meeting the minimum criteria or not etc..and similarly many other possibilities.

GREEDY DECODER

  • This is the most straightforward approach where we select the word that has the highest probability (i.e act greedily). And while it could generate the sequence of words, the quality of output is often low when compared to the other decoding algorithms. You can check out the “Test Drive” section under the image caption post for comparison.

  • It is not possible to show all 36,780 words from the vocabulary here, so we pick the top 60 words for the visualisation purpose. Also, note that it causes the labels to switch at every timestep but hopefully still conveys the idea.

BEAM SEARCH DECODER

  • In the greedy decoder, we considered a single word at every step. What if we could track multiple words at every step and use those to generate multiple hypotheses.

  • This is exactly what the beam search algorithm does, we define how many words (k) we want to keep at every step. The algorithm keeps track of k words along with its score, each seeded from the previous top scoring k words. The score is calculated as a sum of the log probability of the hypothesis generated so far.

  • where t is the step, x is the input image and y are the words generated. The stopping criteria remain the same as greedy search where the hypothesis stops as soon as we encounter <STOP> or run out of a predefined maximum number of steps.The end result is a tree of words (i.e multiple hypotheses), and we pick the one that has the highest score as our final solution.

  • When we use k=1, it works just like the greedy decoder algorithm and suffers from the same issue of producing low-quality output. As we increase k the algorithm starts to generate better quality output, although at larger k the output gets very short. Also, note that increasing k is compute-intensive because we need to keep track of k words at every step.




Greedy Matching And Non-Greedy Matching

  • The usual rule for matching in REs is sometimes called “left-most longest“: when a pattern can be matched at more than one place within a string, the chosen match will be the one that starts at the earliest possible position within the string, and then extends as far as possible. 

  • Normally Perl pattern matching is greedy. By greedy, we mean that the parser tries to match as much as possible. In the string abcbcbcde, for example, the pattern

  • Greedy and non-greedy matching /(bc)+/ can match in six different ways as shown in the figure:

  • In the above image, the third of these matched patterns is “left-most longest, ” also known as greedy.

  •  In some cases, however, it may be desirable to obtain a “left-most shortest” or minimal match. We can make a greedy matching into a non-greedy match using ‘?‘ at the end of RE i.e ‘*?‘ matches the smallest number of instances of the preceding subexpression that will allow the overall match to succeed. Similarly, ‘+?‘ matches at least one instance, but no more than necessary to allow the overall match to succeed, and ‘??‘ matches either zero or one instances, with a preference for zero.

Example: Greedy pattern matching

filter_none

edit

play_arrow

brightness_4

#!/usr/bin/perl

  

# Perl program to show greedy matching

$var = "Geeks For Geeks";

  

# Matching pattern from k to s

$var =~ /(k.*s)(.*)$/;

  

# Printing the resultant string

print($1, "\n");

Here we can see that the code will start to match from k to s and it matches as much as possible.

Example: Non-greedy pattern matching

filter_none

edit

play_arrow

brightness_4

#!/usr/bin/perl

  

# Perl program to show non-greedy matching

my $str = "Geeks For Geeks";

  

# Matching pattern from k to s

$str =~ /.*?\s(.*)/; 

  

# Printing Resultant string

print($1);

When comparing with the non-greedy operator, it will match least code.



RE Functions

//w3school : https://www.w3schools.com/python/python_regex.asp

RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

Function

Description

findall

Returns a list containing all matches

search

Returns a Match object if there is a match anywhere in the string

split

Returns a list where the string has been split at each match

sub

Replaces one or many matches with a string


Regular Expressions: Grouping

Grouping

We group part of a regular expression by surrounding it with parentheses. This is how we apply operators to the complete group instead of a single character.

Capturing Groups

Parentheses not only group sub-expressions but they also create backreferences. The part of the string matched by the grouped part of the regular expression, is stored in a backreference. With the help of backreferences,  we reuse parts of regular expressions. 

In practical applications, we often need regular expressions that can match any one of two or more alternatives. Also, we sometimes want a quantifier to apply to several expressions. All of these can be achieved by grouping with parentheses; and, using alternation with the vertical bar (|).

Alternation is useful when we want to match any one of several different alternatives. For example, the regex aircraft|airplane|jet will match any text that contains aircraft or airplane or jet. The same objective can be achieved using the regex air(craft|plane)|jet. 

Example

import re

s = 'Tahiti $% Tahiti *&^ 34 Atoll'

result = re.findall(r'(\w+)', s)

print result

Output

This gives the output

['Tahiti', 'Tahiti', '34', 'Atoll']


Groups and Grouping using Regular Expressions

  • Suppose that, when you're validating email addresses and want to check the user name and host separately.

  • This is when the group feature of regular expression comes in handy. It allows you to pick up parts of the matching text.

  • Parts of a regular expression pattern bounded by parenthesis() are called groups. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. You have been using the group() function all al

  • ong in this tutorial's examples. The plain match.group() without any argument is still the whole matched text as usual.

Regular Expressions: Use Cases

  • Examining command lines

  • Parsing user input

  • Parsing various text files

  • Examining web server logs

  • Examining test results

  • Finding text in emails

  • Reading configuration files

Basic Lexical Processing

  • In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. 

  • A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth. 


Word Frequencies and Stop Words


  • By word frequency we indicate the number of times each token occurs in a text. When talking about word frequency, we distinguish between types and tokens. Types are the distinct words in a corpus, whereas tokens are the words, including repeats.

  • After tokenising a text, the first figure we can calculate is the word frequency. By word frequency we indicate the number of times each token occurs in a text. When talking about word frequency, we distinguish between types and tokens. Types are the distinct words in a corpus, whereas tokens are the words, including repeats. Let's see how this works in practice.

  • Types are the distinct words in a corpus, whereas tokens are the running words.

  • TF-IDF or ( Term Frequency(TF) — Inverse Dense Frequency(IDF) )is a technique which is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers. However, it just blows up in your face when you ask it to understand the meaning of the sentence or the document.

  • IDF =Log[(# Number of documents) / (Number of documents containing the word)] and

  • TF = (Number of repetitions of word in a document) / (# of words in a document)

TF-IDF Vectorizer

TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

  • Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.


  • Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. Rarer the term, more is the IDF score.



Removing stop words with NLTK in Python

The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

What are Stop words?

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.(Do not forget to change your home directory name)Stop word removal using NLTK

To check the list of stopwords you can type the following commands in the python shell.

import nltk

from nltk.corpus import stopwords

 set(stopwords.words('english'))

{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’}

Note: You can even modify the list by adding words of your choice in the english .txt. file in the stopwords directory.

Tokenization


NLP | How tokenizing text, sentence, words works

  • Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence, information engineering, and human-computer interaction.

  •  This field focuses on how to program computers to process and analyze large amounts of natural language data. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance.

  • Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

Key points of the article – 

  • Text into sentences tokenization

  • Sentences into words tokenization

  • Sentences using regular expressions tokenization


  • Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.


Code #1: Sentence Tokenization – Splitting sentences in the paragraph

from nltk.tokenize import sent_tokenize 


text = "Hello everyone. Welcome to Sandeep Tutorial. You are studying NLP article"

sent_tokenize(text) 


Output :

['Hello everyone.',

 'Welcome to Sandeep Tutorial.',

 'You are studying NLP article']


How sent_tokenize works ?


The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and begining of sentence at what characters and punctuation.

  

Code #2: PunktSentenceTokenizer – When we have huge chunks of data then it is efficient to use it. 


import nltk.data 


# Loading PunktSentenceTokenizer using English pickle file 

tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle') 


tokenizer.tokenize(text) 


Output :

['Hello everyone.',

 'Welcome to sandeep Tutorial.',

 'You are studying NLP article']


Code #3: Tokenize sentence of different language – One can also tokenize sentence from different languages using different pickle file other than English. 

import nltk.data 


spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle') 


text = 'Hola amigo. Estoy bien.'

spanish_tokenizer.tokenize(text) 


Output :

['Hola amigo.', 

 'Estoy bien.']


Code #4: Word Tokenization – Splitting words in a sentence. 


from nltk.tokenize import word_tokenize 


text = "Hello everyone. Welcome to GeeksforGeeks."

word_tokenize(text) 


Output :

['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']


How word_tokenize works?

word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

Code #5: Using TreebankWordTokenizer


from nltk.tokenize import TreebankWordTokenizer 


tokenizer = TreebankWordTokenizer() 

tokenizer.tokenize(text) 


Output :

['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']

These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it does not discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.

Code #6: PunktWordTokenizer – It doen’t seperates the punctuation from the words. 


from nltk.tokenize import PunktWordTokenizer 


tokenizer = PunktWordTokenizer() 

tokenizer.tokenize("Let's see how it's working.") 


Output :

['Let', "'s", 'see', 'how', 'it', "'s", 'working', '.']


 Code #6: WordPunctTokenizer – It seperates the punctuation from the words.


from nltk.tokenize import WordPunctTokenizer 


tokenizer = WordPunctTokenizer() 

tokenizer.tokenize("Let's see how it's working.") 

Output :

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']



Code #7: Using Regular Expression


from nltk.tokenize import RegexpTokenizer 


tokenizer = RegexpTokenizer("[\w']+") 

text = "Let's see how it's working."

tokenizer.tokenize(text) 


Output :

["Let's", 'see', 'how', "it's", 'working']


Code #7: Using Regular Expression

from nltk.tokenize import regexp_tokenize 


text = "Let's see how it's working."

regexp_tokenize(text, "[\w']+") 


Output :

["Let's", 'see', 'how', "it's", 'working']


Bag-of-Words Representation


Popular and simple method of feature extraction with text data which are currently used are:

  1. Bag-of-Words

  2. TF-IDF

  3. Word2Vec


Bag Of Words (BOW):

  • The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

  • The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

  • The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

It involves two things:

  1. A vocabulary of known words: This step revolves around constructing a document corpus which consists of all the unique words in the whole of the text present in the data provided. It is sort of like a dictionary where each index will correspond to one word and each word is a different dimension.

Example: If we are given 4 reviews for an Italian pasta dish.

Review 1 : This pasta is very tasty and affordable.

Review 2: This pasta is not tasty and is affordable.

Review 3 : This pasta is delicious and cheap.

Review 4: Pasta is tasty and pasta tastes good.

Now if we count the number of unique words in all the four reviews we will be getting a total of 12 unique words. Below are the 12 unique words :

  1. ‘This’

  2. ‘pasta’

  3. ‘is’

  4. ‘very’

  5. ‘tasty’

  6. ‘and’

  7. ‘affordable’

  8. ‘not’

  9. ‘delicious’

  10. ‘cheap’

  11. ‘tastes’

  12. ‘good’

2. A measure of the presence of known words : Now if we take the first review and plot count of each word in the below table we will have where row 1 corresponds to the index of the unique words and row 2 corresponds to the number of times a word occurs in a review. (Here review 1)

We will make a sparse vector of d-unique words and for each document (review) we will fill it will number of times the corresponding word occurs in a document.

Applying the Bag of Words model:

Step #1 : We will first preprocess the data, in order to:

  • Convert text to lower case.

  • Remove all non-word characters.

  • Remove all punctuations. 

You can further preprocess the text to suit you needs.

Step #2 : Obtaining most frequent words in our text.

We will apply the following steps to generate our model.

We declare a dictionary to hold our bag of words.

Next we tokenize each sentence to words.

Now for each word in sentence, we check if the word exists in our dictionary.

If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1. 

Step #3 : Building the Bag of Words model

In this step we construct a vector, which would tell us whether a word in each sentence is a frequent word or not. If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.

Stemming and Lemmatization

Introduction to Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Some more example of stemming for root word "like" include:

->"likes"

->"liked"

->"likely"

->"liking"


Errors in Stemming:

There are mainly two errors in stemming – overstemming and under stemming. Over-stemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.

Applications of stemming are:

  1.  Stemming is used in information retrieval systems like search engines.

  2. It is used to determine domain vocabularies in domain analysis.

Google search adopted word stemming in 2003. Previously a search for “cat” would not have returned “catty” or “cats”.

Some Stemming algorithms are:

  • Potter’s Stemmer algorithm
    It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes.
    Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
    Advantage: It produces the best output as compared to other stemmers and it has less error rate.

          Limitation: Morphological variants produced are not always real words.

  • Lovins Stemmer
    It is proposed by Lovins in 1968, that removes the longest suffix from a word then word is recoded to convert this stem into valid words.
    Example: sitting -> sitt -> sit
    Advantage: It is fast and handles irregular plurals like 'teeth' and 'tooth' etc.

           Limitation: It is time consuming and frequently fails to form words from 

           stem.

  • Dawson Stemmer
    It is extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter.
    Advantage: It is fast in execution and covers more suffices.

 Limitation: It is very complex to implement.

Krovetz Stemmer
It was proposed in 1993 by Robert Krovetz. Following are the steps:
1) Convert the plural form of a word to its singular form.
2) Convert the past tense of a word to its present tense and remove the suffix ‘ing’.
Example: ‘children’ -> ‘child’
Advantage: It is light in nature and can be used as pre-stemmer for other stemmers.

  • Limitation: It is inefficient in case of large documents.

  • Xerox Stemmer
    Advantage: It works well in case of large documents and stems produced are valid.

 Limitation: It is language dependent and mainly implemented on english and over stemming may occur.

  • N-Gram Stemmer
    An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common.
    Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
    Advantage: It is based on string comparisons and it is language dependent.

Limitation: It requires space to create and index the n-grams and it is not time efficient.

Python | Stemming words with NLTK

Prerequisite: Introduction to Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Some more example of stemming for root word "like" include:


-> "likes"

-> "liked"

-> "likely"

-> "liking"

Errors in Stemming:

There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.

Applications of stemming are:

  • Stemming is used in information retrieval systems like search engines.

  • It is used to determine domain vocabularies in domain analysis.

Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

Below is the implementation of stemming words using NLTK:

Code #1:

# import these modules 

from nltk.stem import PorterStemmer 

from nltk.tokenize import word_tokenize 


ps = PorterStemmer() 


# choose some words to be stemmed 

words = ["program", "programs", "programer", "programing", "programers"] 


for w in words: 

print(w, " : ", ps.stem(w)) 


Output:

program  :  program

programs  :  program

programer  :  program

programing  :  program

programers  :  program


Code #2: Stemming words from sentences

# importing modules 

from nltk.stem import PorterStemmer 

from nltk.tokenize import word_tokenize 


ps = PorterStemmer() 


sentence = "Programers program with programing languages"

words = word_tokenize(sentence) 


for w in words: 

print(w, " : ", ps.stem(w)) 


Output :

Programers  :  program

program  :  program

with  :  with

programing  :  program

languages  :  languag


What is Stemming?

  • Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.

  • In another word, there is one root word, but there are many variations of the same words. For example, the root word is "eat" and it's variations are "eats, eating, eaten and like so". In the same way, with the help of Stemming, we can find the root word of any variations. 

For example

He was riding.

He was taking the ride.


  • In the above two sentences, the meaning is the same, i.e., riding activity in the past. A human can easily understand that both meanings are the same. But for machines, both sentences are different. 

  • Thus it became hard to convert it into the same data row. In case we do not provide the same data-set, then machine fails to predict. 

  • So it is necessary to differentiate the meaning of each word to prepare the dataset for machine learning. And here stemming is used to categorize the same type of data by getting its root word. 

Conclusion:

  • Stemming is a data-preprocessing module. The English language has many variations of a single word. These variations create ambiguity in machine learning training and prediction. 

  • To create a successful model, it's vital to filter such words and convert to the same type of sequenced data using stemming. Also, this is an important technique to get row data from a set of sentence and removal of redundant data also known as normalization. 


What is Lemmatization?

  • Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. 

  • Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma. The NLTK Lemmatization method is based on WorldNet's built-in morph function.

  •  Text preprocessing includes both stemming as well as lemmatization. Many people find the two terms confusing. Some treat these as same, but there is a difference between these both. Lemmatization is preferred over the former because of the below reason. 

Why is Lemmatization better than Stemming?

  • Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or end of the word.

  • On the contrary, Lemmatization is a more powerful operation, and it takes into consideration morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms. 

  • In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word. Stemming is a general operation while lemmatization is an intelligent operation where the proper form will be looked in the dictionary. Hence, lemmatization helps in forming better machine learning features. 

Python | Lemmatization with NLTK


  • Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. 

  • Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

  • Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

Applications of lemmatization are:

  • Used in comprehensive retrieval systems like search engines.

  • Used in compact indexing 

Examples of lemmatization:


-> rocks : rock

-> corpora : corpus

-> better : good

One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”

Below is the implementation of lemmatization words using NLTK:

# import these modules 

from nltk.stem import WordNetLemmatizer 


lemmatizer = WordNetLemmatizer() 


print("rocks :", lemmatizer.lemmatize("rocks")) 

print("corpora :", lemmatizer.lemmatize("corpora")) 


# a denotes adjective in "pos" 

print("better :", lemmatizer.lemmatize("better", pos ="a")) 


Output :

rocks : rock

corpora : corpus

better : good

Introduction: TF-IDF

  • TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.

  • If i give you a sentence for example “This building is so tall”. Its easy for us to understand the sentence as we know the semantics of the words and the sentence. But how will the computer understand this sentence? The computer can understand any data only in the form of numerical value. So, for this reason we vectorize all of the text so that the computer can understand the text better.

  • By vectorizing the documents we can further perform multiple tasks such as finding the relevant documents, ranking, clustering and so on. This is the same thing that happens when you perform a google search. The web pages are called documents and the search text with which you search is called a query. google maintains a fixed representation for all of the documents. 

  • When you search with a query, google will find the relevance of the query with all of the documents, ranks them in the order of relevance and shows you the top k documents, all of this process is done using the vectorized form of query and documents. Although Googles algorithms are highly sophisticated and optimized, this is their underlying structure.

Now coming back to our TF-IDF,

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

Terminology

  • t — term (word)

  • d — document (set of words)

  • N — count of corpus

  • corpus — the total document set

Term Frequency

  • This measures the frequency of a word in a document. This highly depends on the length of the document and the generality of word, for example a very common word such as “was” can appear multiple times in a document. but if we take two documents one which have 100 words and other which have 10,000 words. 

  • There is a high probability that the common word such as “was” can be present more in the 10,000 worded document. But we cannot say that the longer document is more important than the shorter document. For this exact reason, we perform a normalization on the frequency value. we divide the the frequency with the total number of words in the document.

  • Recall that we need to finally vectorize the document, when we are planning to vectorize the documents, we cannot just consider the words that are present in that particular document. If we do that, then the vector length will be different for both the documents, and it will not be feasible to compute the similarity. So, what we do is that we vectorize the documents on the vocab. vocab is the list of all possible words in the corpus.

  • When we are vectorizing the documents, we check for each words count. In worst case if the term doesn’t exist in the document, then that particular TF value will be 0 and in other extreme case, if all the words in the document are same, then it will be 1. The final value of the normalised TF value will be in the range of [0 to 1]. 0, 1 inclusive.

TF is individual to each document and word, hence we can formulate TF as follows.

tf(t,d) = count of t in d / number of words in d

  • If we already computed the TF value and if this produces a vectorized form of the document, why not use just TF to find the relevance between documents? why do we need IDF?

  • Let me explain, though we calculated the TF value, still there are few problems, for example, words which are the most common words such as “is, are” will have very high values, giving those words a very high importance. 

  • But using these words to compute the relevance produces bad results. These kind of common words are called stop-words, although we will remove the stop words later in the preprocessing step, finding the importance of the word across all the documents and normalizing using that value represents the documents much better.

Document Frequency

  • This measures the importance of document in whole set of corpus, this is very similar to TF. The only difference is that TF is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present. We consider one occurrence if the term consists in the document at least once, we do not need to know the number of times the term is present.

df(t) = occurrence of t in documents

  • To keep this also in a range, we normalize by dividing with the total number of documents. Our main goal is to know the informativeness of a term, and DF is the exact inverse of it. that is why we inverse the DF

Inverse Document Frequency

IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.

idf(t) = N/df

Now there are few other problems with the IDF, in case of a large corpus, say 10,000, the IDF value explodes. So to dampen the effect we take log of IDF.

During the query time, when a word which is not in vocab occurs, the df will be 0. As we cannot divide by 0, we smoothen the value by adding 1 to the denominator.

idf(t) = log(N/(df + 1))

Finally, by taking a multiplicative value of TF and IDF, we get the TF-IDF score, there are many different variations of TF-IDF but for now let us concentrate on the this basic version.

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

Building a Spam Detector - I,

  • We all face the problem of spams in our inboxes. Let’s build a spam classifier program in python which can tell whether a given message is spam or not! We can do this by using a simple, yet powerful theorem from probability theory called Baye’s Theorem.

  •  It is mathematically expressed as



  • In this tutorial we will begin by laying out a problem and then proceed to show a simple solution to it using a Machine Learning technique called a Naive Bayes Classifier. This tutorial requires a little bit of programming and statistics experience, but no prior Machine Learning experience is required.

Spam Detection

  • You work as a software engineer at a company which provides email services to millions of people. Lately, spam has a been a major problem and has caused your customers to leave. Your current spam filter only filters out emails that have been previously marked as spam by your customers. However, spammers have become more clever and in order to stop your customers from leaving, you are given the assignment of predicting whether an email is spam.

Training and Testing Data

  • In order to create algorithm for this, you need to teach your program what a spam email looks like (and what non-spam emails look like). 

  • Luckily, you have all the previous emails that have been marked as spam by your customers. You also need a way to test the accuracy of your spam filter. One idea would be to test it on the same data that you used for training. However, this can lead to a major problem in ML called overfitting which means that your model is too biased towards the training data and may not work as well on elements outside of that training set. 

  • One common way to avoid this is to split your labeled data 70/30 for training/testing. This ensures you test on different data than you trained on. It’s important to note that you need a mix of both spam and non-spam data in your data sets, not just the spam ones. 

  • You really want your training data to be as similar to real email data as possible, I’ve linked a good data set at the bottom of this post.

Laplace Smoothing:


  • One thing we haven’t mentioned is what happens if a word in the email you’re classifying wasn’t in your training set. In order to handle this case we need to add a smoothing factor.


Log-Space


  • Our current implementation relies heavily on floating point multiplication. To avoid all the potential issues with multiplying very small numbers one usually performs a logarithm on the equation to transform all the multiplications to addition. I didn’t implement this in my sample code but it is strongly recommended in practice.


TF-IDF

Overall, the bag of words model for text classification is fairly naive and could be improved upon by something else like TF-IDF.

N-Grams

Another improvement we could make is not just counting individual words. N-Grams is a technique in which one considers sets of N consecutive words and uses them to calculate the probabilities. This makes sense because in English the 1-gram ‘good’ conveys something different than the 2-gram ‘not good’.

Advanced Lexical Processing

Canonicalisation

Definition - What does Canonicalization mean?

  • Canonicalization is the process of converting data that involves more than one representation into a standard approved format. Such a conversion ensures that data conforms to canonical rules. This compares different representations to assure equivalence, to count numbers of distinct data structures, to impose a meaningful sorting order and to improve algorithm efficiency, thus eliminating repeated calculations.

  • Canonicalization is used in numerous Internet and computer applications to generate canonical data from noncanonical information. Canonical representation of data is widely used in search engine optimization (SEO), Web servers, Unicode and XML.

  • This term is also known as C14N, standarization or normalization.

  • In SEO, URL canonicalization deals with Web content with more than one possible URL. 

  • This may create discrepancies in searches because the search engine may not being aware of which URL should be displayed. Canonicalization picks the best URL from several choices, usually referring to home pages. Although certain URLs appear to be the same, Web servers return different results for the URLs. Search engines consider only one URL in canonical form.

  • Computer security is based on file name canonicalization. Some Web servers may have a security rule to execute files only under a particular directory. The file is then executed only if the path has the specified directory in its name. Special care has to be taken to check if the file name is a unique representation. Such vulnerability is called directory traversal.

  • Most of the characters in the Unicode standard have variable-length encodings. This requires a consideration of each string character and makes the string validation more complex. 

  • If all character encodings are not considered in the software implementation, there arises a possibility of bugs. This problem can be eliminated using single encoding for every character. The best alternative, which any software can take, is to check if the string is canonicalized. Strings that are not canonicalized can be rejected.

  • A canonical XML document is an XML document in XML canonical form. It is defined by canonical XML specification. Canonicalization in XML eliminates white space within tags, sorts namespace references and eliminates redundant ones, and uses particular character encodings. It also removes XML and DOCTYPE declarations, in addition to transforming relative URLs into absolute URLs.

  • In computer science, canonicalization (sometimes standardization or normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order. 

Phonetic Hashing

Phonetic Hashing and Soundex in Python

  • Reducing a word to its base form using Stemming and Lemmatization is a part of the technique called Canonicalisation. Stemming tries to reduce a word to its root form. Lemmatization tries to reduce a word to its lemma. The root and the lemma are nothing but the base forms of the inflected words. just that the method is different in both.

  • There are some cases that can’t be handled either by stemming nor lemmatization. You need another preprocessing method in order to stem or lemmatize the words efficiently.

  • For example if the corpus contains two misspelled versions of the word ‘disappearing’ — ‘dissappearng’ and ’dissapearing’. After you stem these words, you’ll have two different stems — ‘dissappear’ and ‘dissapear’. You still have the problem of redundant tokens. On the other hand, lemmatization won’t even work on these two words and will return the same words if it is applied because it only works on correct dictionary spelling.

  • To deal with different spellings that occur due to different pronunciations, we use the concept of phonetic hashing which will help you canonicalise different versions of the same word to a base word.

  • There are certain words which have different pronunciations in different languages. As a result, they end up being spelt differently. Examples of such words include names of people, city names, names of dishes, etc. Take, for example, New Delhi. Delhi is also pronounced as Dilli in Hindi. Hence, it is not surprising to find both variants in an uncleaned text corpus.

  • Phonetic hashing buckets all the similar phonemes (words with similar sound or pronunciation) into a single bucket and gives all these variations a single hash code. Hence, the word ‘Dilli’ and ‘Delhi’ will have the same code.

  • Phonetic hashing is done using the Soundex algorithm. It doesn’t matter which language the input word comes from — as long as the words sound similar, they will get the same hash code.

Now, let’s see it through an example. The Soundex of the word ‘Mississippi’. To calculate the hash code, following are the steps:

  1. Phonetic hashing is a four-letter code. The first letter of the code is the first letter of the input word. Hence it is retained as is. The first character of the phonetic hash is ‘M’. Now, we need to make changes to the rest of the letters of the word.

  2. Now, we need to map all the consonant letters (except the first letter). All the vowels are written as is and ‘H’s, ‘Y’s and ‘W’s remain unencoded (unencoded means they are removed from the word). After mapping the consonants, the code becomes MI22I22I11I.

  3. The third step is to remove all the vowels. ‘I’ is the only vowel. After removing all the ‘I’s, we get the code M222211. Now, you would need to merge all the consecutive duplicate numbers into a single unique number. All the ‘2’s are merged into a single ‘2’. Similarly, all the ‘1’s are merged into a single ‘1’. The code that we get is M21.

  4. The fourth step is to force the code to make it a four-letter code. You either need to pad it with zeroes in case it is less than four characters in length. Or you need to truncate it from the right side in case it is more than four characters in length. Since the code is less than four characters in length, you’ll pad it with one ‘0’ at the end. The final code is M210.

  5. This way you can found out all the words in the corpus having same hash code and replace them with the correct word.

Edit Distance

Edit Distance

  • Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two strings referred to as the source string and the target string.


  • The distance between the source string and the target string is the minimum number of edit operations (deletions, insertions, or substitutions) required to transform the source into the target. The lower the distance, the more similar the two strings. 

  • Among the common applications of the Edit Distance algorithm are: spell checking, plagiarism detection, and translation memory systems.

Edit Distance Python NLTK

NLTK library has the Edit Distance algorithm ready to use. Let’s take some examples.

Example #1

import nltk


w1 = 'mapping'

w2 = 'mappings'


nltk.edit_distance(w1, w2)


The output is 1 because the difference between “mapping” and “mappings” is only one character, “s”.


Example #2


import nltk


sent1 = "It might help to re-install Python if possible."

sent2 = "It can help to install Python again if possible."

sent3 = "It can be so helpful to reinstall C++ if possible."

sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.

sent5 = "I love Python programming."


ed_sent_1_2 = nltk.edit_distance(sent1, sent2)

ed_sent_1_3 = nltk.edit_distance(sent1, sent3)

ed_sent_1_4 = nltk.edit_distance(sent1, sent4)

ed_sent_1_5 = nltk.edit_distance(sent1, sent5)



print(ed_sent_1_2, 'Edit Distance between sent1 and sent2')

print(ed_sent_1_3, 'Edit Distance between sent1 and sent3')

print(ed_sent_1_4, 'Edit Distance between sent1 and sent4')

print(ed_sent_1_5, 'Edit Distance between sent1 and sent5')


Output : 

14 Edit Distance between sent1 and sent2

19 Edit Distance between sent1 and sent3

32 Edit Distance between sent1 and sent4

33 Edit Distance between sent1 and sent5


So it is clear that sent1 and sent2 are more similar to each other than other sentence pairs.

Jaccard Distance

  • Jaccard Distance is a measure of how dissimilar two sets are.  The lower the distance, the more similar the two strings.

  • Jaccard Distance depends on another concept called “Jaccard Similarity Index” which is (the number in both sets) / (the number in either set) * 100

  • J(X,Y) = |X∩Y| / |X∪Y|

  • Then we can calculate the Jaccard Distance as follows:

  •  D(X,Y) = 1 – J(X,Y)

  • For example, if we have two strings: “mapping” and “mappings”, the intersection of the two sets is 6 because there are 7 similar characters, but the “p” is repeated while we need a set, i.e. unique characters, and the union of the two sets is 7, so the Jaccard Similarity Index is 6/7 = 0.857 and the Jaccard Distance is 1 – 0.857 = 0.142


Jaccard Distance Python NLTK

The good news is that the NLTK library has the Jaccard Distance algorithm ready to use. Let’s take some examples.

Example #1

import nltk


w1 = set('mapping')

w2 = set('mappings')


nltk.jaccard_distance(w1, w2)


Unlike Edit Distance, you cannot just run Jaccard Distance on the strings directly; you must first convert them to the set type.


Spell Corrector

Checking of spelling is a basic requirement in any text processing or analysis. The python package pyspellchecker provides us this feature to find the words that may have been mis-spelled and also suggest the possible corrections.

First, we need to install the required package using the following command in our python environment.

pip install pyspellchecker


from spellchecker import SpellChecker


spell = SpellChecker()


# find those words that may be misspelled

misspelled = spell.unknown(['let', 'us', 'wlak','on','the','groun'])


for word in misspelled:

    # Get the one `most likely` answer

    print(spell.correction(word))


    # Get a list of `likely` options

    print(spell.candidates(word))


Output : 

When we run the above program we get the following output −

group

{'group', 'ground', 'groan', 'grout', 'grown', 'groin'}

walk

{'flak', 'weak', 'walk'}

  • Correct spelling is very important for any kinds of documentation. Many of automatic spell check is available in the online for different language. It helps us to correct the written wrong word or replace the correct automatically. 

  • Also, helps to find out the grammatical mistake and syntax error. If we look at an example such as Grammarly automatic spell checker is the best example for everyone. 

  • There is no automatic spell checker are present for our Bengali language. But automatic spell checker looks like Grammarly software is badly need for every Bengali language users.

  • Here is discussed an approach for making an automatic Bengali correct word replacement for building an automatic spell checker. The whole procedure depends on word2vec. Pre-trained word2vec file is used for this which has a small vocabulary. But effective for this work. Given a short description in below for making a spell checker.

Library Function

  • gensim library function is used to load the Bengali pre-trained word2vec file from pc.

import gensim

Word2Vec

  • Word embedding is one of the most significant strategies in common language processing, where words are mapped to vectors of genuine numbers. 

  • Word embedding is fit for catching the importance of a word in a report, semantic and syntactic closeness, connection with different words. It additionally has been broadly utilized for recommender frameworks and content arrangement.

  • ‘bnword2vec’ is a pre-trained word2vec file for the Bengali language and ‘.txt’ is the extension of the loaded file.

  • model = gensim.models.KeyedVectors.load_word2vec_format('bnword2vec.txt')

Words Rank

  • Words rank-ordering archive significance dependent on the area of a looked through watchword in the sentence. Here it’s discovering the centre word from the Word2Vec document.

words = keeps the word index number from Word2vec file.

w_rank = It is a dictionaries which put all words when the loop is working.

enumerate() = Enumerate is a method adds a counter to an iterable and returns it in a form of enumerate object.

WORDS = This varible carry the value of w_rank dictionaries.

Pointwise mutual information




Understanding Pointwise Mutual Information in NLP


  • Natural Language Processing (NPL) is a field of Artificial Intelligence whose purpose is finding computational methods to interpret human language as it is spoken or written. 

  • The idea of NLP goes beyond a mere classification task which could be carried on by ML algorithms or Deep Learning NNs.

  •  Indeed, NLP is about interpretation: you want to train your model not only to detect frequent words, to count them or to eliminate some noisy punctuations; you want it to tell you whether the mood of the conversation is positive or negative, whether the content of an e-mail is mere publicity or something important, whether the reviews about thriller books in last years have been good or bad.

  • However, it is more cumbersome treating words, documents, topics as numerical inputs rather than working with numbers or images.

  •  Indeed, even NLP uses deep neural nets techniques to solve a given task and, by definition, each and every algorithm can only accept numeric variables as inputs. So, how can we ‘quantify’ a word?

  • The idea behind the NLP algorithm is that of transposing words into a vector space, where each word is a D-dimensional vector of features. By doing so, we can compute some quantitative metrics of words and between words, namely their cosine similarity. 

  • Needless to say, this is far from being an exhaustive explanation, hence if you are interested I suggest this reading.

  • In this article, I’m going to dwell on a very specific aspect of NLP, which is how to understand whether two (or more) words actually form a unique concept. 

  • If this is the case, we can reduce the dimensionality of our task, since that couple of words (called bigram- or n-grams in the general case for n words) can be considered as a single word. Hence, we can remove one vector from our computations.

  • Namely, consider the expression ‘social media’: both the words can have independent meaning, however, when they are together, they express a precise, unique concept.

  • Nevertheless, it is not an easy task, since if both words are frequent by themselves, their co-occurrence might be just a chance. Namely, consider the name ‘Las Vegas’: it is not that frequent to read only ‘Las’ or ‘Vegas’ (in English corpora of course). 

  • The only way we see them is in the bigram Las Vegas, hence it is likely for them to form a unique concept. On the other hand, if we think of ‘New York’, it is easy to see that the word ‘New’ will probably occur very frequently in different contexts. How can we assess that the co-occurrence with York is meaningful and not as vague as ‘new dog, new cat…’?

  • The answer lies in the Pointwise Mutual Information (PMI) criterion. The idea of PMI is that we want to quantify the likelihood of co-occurrence of two words, taking into account the fact that it might be caused by the frequency of the single words. Hence, the algorithm computes the (log) probability of co-occurrence scaled by the product of the single probability of occurrence as follows:

  • Now, knowing that, when ‘a’ and ‘b’ are independent, their joint probability is equal to the product of their marginal probabilities, when the ratio equals 1 (hence the log equals 0), it means that the two words together don’t form a unique concept: they co-occur by chance.

  • On the other hand, if either one of the words (or even both of them) has a low probability of occurrence if singularly considered, but its joint probability together with the other word is high, it means that the two are likely to express a unique concept.

||========UNIT ONE END HERE=========||













Comments

Popular posts from this blog