Tokenization in NLP
Basics and illustration with latest technologies
What is Tokenization?
Tokenization is a common area in Natural Language Processing and it is a powerful way of dealing with text data. Tokenization is a way of separating a piece of text into smaller units called “tokens”.
“Tokens are the building blocks of Natural Language.”
Here, tokens can be either characters, words, or sub words. Hence, tokenization can be broadly classified into 3 types — character, word, and sub word tokenization. We saw a most of the existing approaches have implemented tokenization using Python. The main goal of the tokenization is accomplish to create a vocabulary.
Tokenization is performed on the given corpus to obtain tokens. Then the following tokens are used to prepare a vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering each unique token in the corpus or by considering the top K Frequently Occurring Words.
Let’s see main 3 types of tokenization,
01. Character Tokenization
- Character Tokenization splits apiece of text into a set of characters.
02. Word Tokenization
- It splits a piece of text into individual words based on a certain delimiter.
03.Sub word Tokenization
- Sub word Tokenization splits the piece of text into sub words (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.
What are the techniques?
Tokenization is a critical step in the overall NLP pipeline. Therefor we can use main 6 techniques to do the tokenization. Most of them have capability to do word tokenization as well as sentence tokenization.
1. Python’s split() function
2. Using Regular Expressions (RegEx)
3. Using NLTK
4. Using the spaCy library
5. Using Keras
6.Using Gensim
- Python’s split() function
split() method as it is the most basic one. It returns a list of strings after breaking the given string by the specified separator. By default, split() breaks a string at each space. We can change the separator to anything.
2. Using Regular Expressions (RegEx)
Regular expression is basically a special character sequence that helps you match or find other strings or sets of strings using that sequence as a pattern. We can use the re library in Python to work with regular expression. This library comes preinstalled with the Python installation package.
3. Using NLTK
This is a library that will appreciate the more you work with text data. NLTK, short for Natural Language ToolKit, is a library written in Python for symbolic and statistical Natural Language Processing. You have to have pip installation for use NLTK with python.
NLTK contains a module called tokenize() which further classifies into two sub-categories:
- Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words
- Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences
4. Using the spaCy library
spaCy library is one of the accurate and useful way for when doing on an NLP project. spaCy is an open-source library for advanced Natural Language Processing (NLP). It supports over 49+ languages and provides state-of-the-art computation speed.
let’s see how we can utilize the awesomeness of spaCy to perform tokenization.
5. Using Keras
Keras! One of the hottest deep learning frameworks in the industry right now. It is an open-source neural network library for Python. Keras is super easy to use and can also run on top of TensorFlow. In the NLP context, we can use Keras for cleaning the unstructured text data that we typically collect.
6. Using Gensim
The final tokenization method we will cover here is using the Gensim library. It is an open-source library for unsupervised topic modeling and natural language processing and is designed to automatically extract semantic topics from a given document. And also we can install Gensim by pip installation. We can use the gensim.utils class to import the tokenize method for performing word tokenization.
In this article, we saw the basic illustration of tokenization (word as well as a sentence) from a given text. There are other ways as well but these are good enough to get you started on the topic.!