Natural language processing (or NLP) is a branch of artificial intelligence that gives machines the ability to read, understand and derive meaning from human languages. It can allow computers to communicate with people using a human language in both written and spoken form.
NLP draws from several disciplines – including Linguistics and Computer Science – to decipher language structure and to build models. These systems are able to comprehend, break down and separate significant details from text and speech.
Every day humans interact with each other through social media, transferring vast quantities of freely available data to one another. This data can be extremely useful in understanding human behaviour and customer habits. Data Analysts and machine learning experts utilize this data to give machines the ability to mimic human linguistic behaviour.
There are three different types of NLP:
- Natural Language Understanding (NLU): Can enable the restructuring of unstructured data so that machines can understand it and analyse it to discover key facts of the entity or feature.
- Natural Language Generation (NLG): Analyses documents and generates summaries and explanations as input data for an AI/ML model.
- Language Processing & OCR: Can be combined with Optical Character Recognition (OCR) technology to convert data contained in images or videos into plain text, which can then be analysed.
NLP Pipeline
Figure 1. NLP Pipeline
A typical NLP pipeline consists of several steps as outlined below:
- Segmentation breaks text apart into separate sentences.
- Tokenization breaks each sentence into separate words or
- Parts of Speech assigned to each token. This is done by feeding each token to a pre-trained model.
- Lemmatisation consists of identifying the most basic form (lemma) of each token: for example, for cats the lemma is cat, for found it is find.
- Identifying stop words, that is word frequently occurring grammatical words, usually lacking conceptual meaning, such as and, to, for, so, etc. These words might be filtered out before performing a statistical analysis as they would not be very useful.
- Dependency Parsing is the process of identifying hierarchical relationships between words which describes how each word depends on the others.
- Noun Phrases Identification is usually done to group together words which refer to the same entity to simplify the analysis.
- Named Entity Recognition consists of detecting and labelling nouns with the real-world entities they represent.
- Coreference Resolution consists of mapping pronouns, such as he, she, him, it, etc… to the entity they refer to, to figure out all the pronouns that refer to the same entity. This is the most complex step
Figure 2 shows the result of applying coreference resolution to a text.
Applications
NLP has a huge variety of business applications. Here is an overview of the most important ones:
- Social Media Sentiment Analysis: helps to analyse how a brand or a product is doing based on positive, negative or neutral emotions it finds within its social mentions. In this way, it provides actionable insights.
- Patient Voice & Healthcare: NLP technologies are used to capture and manage patient notes and feedback, so that the quality of the offered services can be improved.
- Language Translation: NLP technologies provide text to text and speech to text translation at scale, quickly and efficiently. This allows people to watch foreign films with subtitles or read articles written in a foreign language. Despite these remarkable results, machine translation still has a long way to go because some of the languages of the world present a degree of complexity that cannot yet be broken down by machine learning models.
- Text Analytics: NLP technologies can collect a variety of sources (news, social media posts, tweets) on a given topic: such as a product or a company. They can convert this raw data into meaningful information that can be used to derive meaningful insights.
- Finance and Automated Trading: NLP can review financial news and make recommendations on which stock would be a good investment. It can also make transactions and investments based on human instructions.
- Virtual Assistant: NLP enables voice recognition algorithms to recognize words and speech patterns and infer meaning from them. This is what allows us to talk to smartphones or virtual assistants and get responses in a conversational style.
History
NLP originated in the 1940s when scientists started working on algorithms that would allow machines to perform translations from one language to another. One of the first researchers to work on machine translation was Warren Weaver, an American mathematician.
However, they soon realised that the task was more complex than they expected, and that they lacked the right technological resources and linguistic theoretical framework to perform such a task. Several changes needed to take place to build a machine to perform translations or to communicate in a more human fashion.
Those changes occurred in the past 60/65 years. Firstly, American linguist Noam Chomsky developed an abstract, mathematical theory of language known as transformational grammar in his seminal work: Syntactic Structures published in 1957. Transformational grammar is important in NLP because it introduced a formalism which converts natural language sentences into a format which can be used by machines.
Secondly, in the 1980s two major changes occurred; the first one is the increase in computational power which allowed to perform more and more complex operations, the second one is the shift to machine learning algorithms which rely heavily on statistical models.
Finally, in the 2010s, the deep learning revolution occurred and deep neural network-style machine learning methods became widespread in NLP. Up to the 1980s most NLP systems were based on complex sets of hand-written rules, the major advantage of machine learning is that it calls for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples (usually corpora of texts or speech).
All these theoretical and technological advancement facilitated the creation of more sophisticated translation software. Text comprehension and speech processing technologies also improved which ultimately led to the becoming of virtual assistants today which are able to understand human language and provide responses.