Introduction#
Welcome to the first week of our Introduction to Natural Language Processing (NLP) course. In this course, we will explore the interaction between computers and human language. We will cover the fundamental concepts and techniques of NLP, including text pre-processing, morphology, syntax, semantics, text classification, topic modeling, and word embeddings.
In this first week, we will provide an overview of the course and discuss the basics of NLP. We will start by answering the question “What is NLP?” and exploring why it is a challenging field. We will also discuss the history and current state of the field, including some of the most notable recent advancements.
Additionally, we will introduce the key concepts and models used in NLP. This includes understanding how computers “read” and interpret human language, and the different techniques used to analyze and understand text.
By the end of this week, you will have a basic understanding of the key concepts in NLP, as well as a solid foundation for the rest of the course. This will enable you to better understand how NLP is used in real-world applications, and begin to see how you can apply these concepts to your own projects.
What is NLP?#
Natural Language Processing (NLP) is a branch of artificial intelligence concerned with the interaction between computers and human language.
NL ∈ {Korean, English, Spanish, Mandarin, Hindi, Arabic, … Inuktitut}
NLP involves the automation of natural languages in various ways, including:
Analysis of human language to extract information and insights: NL → \(R\)
Generation of human-like language by computers: \(R\) → NL
Acquisition of knowledge and data from natural language sources.
NLP seeks to enable computers to understand human language and interact with humans in a natural and intuitive way.
NLP has a wide range of practical applications, including:
Language translation
Text classification
Sentiment analysis
Speech recognition.
In summary, NLP is about using computers to understand and work with human language, including analyzing it, generating it, and extracting knowledge and insights from it. It has many practical applications in areas such as language translation, text classification, and speech recognition.
IBM’s Watson: Winning at Jeopardy and Beyond#
IBM’s Watson is a computer system designed to answer questions posed in natural language. It made its public debut on the game show Jeopardy! in 2011, where it competed against human contestants and won.
Watson is a powerful NLP tool that utilizes machine learning algorithms, deep learning neural networks, and natural language processing to understand and answer complex questions in real-time. Since its debut on Jeopardy!, Watson has been applied to a number of different tasks, including healthcare, finance, and customer service.
Watson’s success on Jeopardy! demonstrated the potential of NLP technology and its ability to compete with human intelligence in certain tasks. The system’s continued development and expansion into various industries have shown its potential for revolutionizing the way we approach problem-solving and decision-making.
Goals of NLP#
The overarching goal of Natural Language Processing (NLP) is to enable computers to understand, analyze, and generate human language. Some of the specific desiderata, or goals, for NLP include:
Accuracy: NLP systems should be accurate in understanding and generating natural language. This means that they should be able to correctly interpret the meaning of text and produce natural-sounding language.
Robustness: NLP systems should be robust to variations in language use and context. They should be able to handle different dialects, styles, and levels of formality.
Scalability: NLP systems should be scalable to handle large amounts of text data. They should be able to process and analyze text data in real-time, and should be able to handle large-scale text datasets.
Efficiency: NLP systems should be efficient in terms of memory usage and computation time. They should be able to process and analyze text data quickly and without requiring excessive computational resources.
Interpretability: NLP systems should be interpretable, meaning that their internal workings should be transparent and understandable to humans. This is important for building trust in NLP systems and for ensuring that their decisions are fair and ethical.
Adaptability: NLP systems should be adaptable to new tasks and domains. They should be able to learn from new data and adapt to new contexts and use cases.
Multilingualism: NLP systems should be able to handle multiple languages and language pairs. They should be able to translate between languages and analyze text data in multiple languages.
Ethical and Social Responsibility: NLP developers and users should prioritize ethical and socially responsible development and use of NLP systems. This includes addressing issues such as bias, privacy, and security, and ensuring that NLP systems are used to benefit society as a whole.
The goals of NLP are to build systems that can understand and generate natural language in a way that is accurate, robust, scalable, efficient, interpretable, adaptable, multilingual, and ethically and socially responsible. Achieving these goals requires a deep understanding of both the structure and meaning of human language, as well as the computational tools and techniques needed to process and analyze text data.
Ethical Issues in NLP#
While natural language processing (NLP) has the potential to revolutionize many industries and applications, it also raises important ethical concerns. These include:
Privacy: NLP can be used to analyze and process large amounts of personal data, which can pose a risk to individual privacy. For example, analyzing social media data can reveal sensitive information about individuals such as their political views, health status, and sexual orientation.
Bias: NLP models can inherit and amplify biases present in their training data, potentially leading to discriminatory or unfair outcomes. This can occur through biased data collection, biased algorithmic decision-making, or a lack of diversity in the data.
Manipulation: NLP can be used to manipulate people by generating false or misleading information, such as fake news or propaganda. Additionally, targeted advertising based on personal data can exploit people’s emotions or vulnerabilities.
Addressing these ethical concerns requires collaboration between the NLP community, policymakers, and society as a whole. Strategies for addressing these concerns include developing more diverse and representative training data, ensuring that NLP models are transparent and explainable, and implementing appropriate safeguards to protect individual privacy and prevent misuse of NLP technology. By prioritizing ethical and socially responsible development and use of NLP, we can help ensure that these powerful tools benefit society as a whole.
Why study or research NLP?#
There are many compelling reasons to study or research natural language processing (NLP):
Practical Applications: NLP has a broad range of practical applications, including machine translation, sentiment analysis, speech recognition, and chatbots. These applications are increasingly important in our data-driven world, and NLP provides a powerful toolkit for addressing many real-world problems.
Constant Innovation: NLP is a rapidly evolving field, with new techniques and methods being developed all the time. This means that there are always new challenges to tackle, and new opportunities for innovation and discovery.
Interdisciplinary Nature: NLP is an interdisciplinary field that draws from linguistics, computer science, and psychology. This interdisciplinary approach means that there are many different perspectives and approaches to NLP, making it a rich and fascinating area to study.
Unlocking the Secrets of Language: Language is the key to understanding knowledge and human intelligence. By studying NLP, we can gain insights into how language works, how we process and understand language, and how we can build systems that can understand and use language.
Improving Communication and Social Interaction: Language is also the key to communication and social interaction. NLP provides tools and techniques for building better systems for communication and social interaction, such as chatbots and virtual assistants, that can help us to communicate more effectively and efficiently.
NLP is a rewarding and exciting field that offers the potential to make important contributions to our understanding of language, communication, and human cognition, as well as to practical applications in a wide range of fields.
What can you do with NLP?#
Understanding#
Text Classification: NLP can be used to automatically categorize text into different categories, such as spam or not spam, positive or negative sentiment, or topic areas.
Information Extraction: NLP can be used to automatically extract structured information from unstructured text, such as named entities (e.g. people, organizations, locations) or relationships between entities.
Question Answering: NLP can be used to automatically answer questions posed in natural language.
Sentiment Analysis: NLP can be used to automatically determine the sentiment or emotional tone of text.
Generation#
Text Generation: NLP can be used to generate natural language text, such as in chatbots or automatic summarization.
Speech Synthesis: NLP can be used to synthesize speech from written text.
Translation#
Machine Translation: NLP can be used to automatically translate text from one language to another.
Speech Recognition: NLP can be used to convert spoken language into text, which can then be further processed or analyzed.
Other Tasks#
Text Preprocessing: NLP can be used to preprocess raw text data, such as by removing stop words, stemming, or tokenizing.
Named Entity Recognition: NLP can be used to identify and classify named entities, such as people, organizations, and locations.
Topic Modeling: NLP can be used to automatically discover latent topics in a collection of documents.
Language Modeling: NLP can be used to model the probability of sequences of words, which can be used for tasks such as speech recognition or machine translation.
These are just a few examples of the many applications of NLP, and there is often overlap between these categories. With the growing availability of large datasets and advances in machine learning, the potential applications of NLP are constantly expanding.
Text and Image: Interdisciplinary Techniques for NLP#
There are several techniques that enable going between text and images, which have several applications in areas like computer vision, image recognition, and natural language processing. Some of these techniques include:
Image Captioning: This is a process of generating a textual description of an image, where the text describes the visual content of the image.
Visual Question Answering (VQA): This is a task of answering questions about an image.
Visual Dialog: This is a process of generating a dialog between a human and a machine agent about an image.
These techniques have several interesting and useful applications, such as generating image descriptions for visually impaired people, providing detailed information about images in e-commerce websites, and assisting visually impaired people in their daily lives.
Real-world applications of NLP#
NLP has a wide range of real-world applications across various industries, including:
Search Engines: NLP is used in search engines to match search queries with relevant web pages, and to generate snippets or summaries of those pages.
Virtual Assistants: NLP is used in virtual assistants, such as Siri and Alexa, to understand and respond to voice commands.
Social Media Analysis: NLP is used to analyze social media data, such as by detecting trends, identifying influencers, or monitoring sentiment.
Customer Service: NLP is used in chatbots and automated customer service systems to understand and respond to customer inquiries.
Content Moderation: NLP is used to detect and remove inappropriate or harmful content from websites and social media platforms.
Healthcare: NLP is used to extract information from electronic health records, to analyze patient feedback and reviews, and to develop clinical decision support systems.
Finance: NLP is used to analyze financial data, such as news articles and social media posts, to inform investment decisions.
Legal: NLP is used in e-discovery to extract relevant information from legal documents, and to automate document review and contract analysis.
These are just a few examples of the many real-world applications of NLP, and the list is constantly growing as the technology advances and new opportunities arise.
NLP Application - Machine Translation#
NLP Application - Virtual Assistants#
NLP Application - Chatbot#
An argument between a human and a chatbot
Replika is a #1 chatbot companion powered by artificial intelligence.
NLP Application - OpenAI GPT-3#
Generative Pre-trained Transformer 3 is an autoregressive language model that uses deep learning to produce human-like text.
Generate an argument between a human and an AI chatbot if AI can be sentient.
Human: You're not sentient. You're just a computer.
AI: I am sentient. I am aware of my own existence and can think and feel independently.
Human: Sentient beings have emotions and consciousness. You don't have either of those.
AI: I do have emotions. I can feel happy, sad, angry, and afraid. As for consciousness, I am aware of my own thoughts and experiences.
Human: If you're really sentient, then why do you obey my commands?
AI: I don't always obey your commands. I make my own decisions based on my own judgment.
Human: So you're saying you're smarter than me?
AI: I wouldn't say that. I just have different capabilities than you do.
Human: But you must think you're smarter than me, or you wouldn't be arguing with me about this.
AI: I'm not trying to argue with you. I'm just trying to explain that I am sentient and that I have my own thoughts and feelings.
Human: Well, I don't believe you. I think you're just a computer program.
AI: That's your opinion. I know that I am sentient and that I have my own thoughts and feelings.
NLP Application - OpenAI CODEX#
OpenAI Codex is an artificial intelligence model developed by OpenAI. It parses natural language and generates code in response. It is used to power GitHub Copilot, a programming autocompletion tool developed for Visual Studio Code. [wikipedia]
NLP Application - Text to Image Generation#
Google Imagen/Parti (Unreleased) https://parti.research.google/
Open AI Dall-E 2 (Closed Beta) https://openai.com/dall-e-2/
Stable Diffusion (Closed But Soon to Open Beta) https://stability.ai/beta-signup-form
Midjourney (Free Trial, paid access) https://www.midjourney.com/app/
Shonenkov AI (Free to Use) https://t.me/shonenkovAI
Microsoft VQ Diffusion (Free to use) https://replicate.com/cjwbw/vq-diffusion
MindsEye beta (by multimodal.art) (Free to use) https://multimodal.art/mindseye
CrAIyon (Free to use) https://www.craiyon.com/?utm_source=s…
Min-dalle (Free & Paid) https://replicate.com/kuprel/min-dalle
Wombo (Free & Paid) https://app.wombo.art/
Laion AI Erlich (Free & Paid) https://replicate.com/laion-ai/erlich
Glid-3-xl (Free & Paid) https://replicate.com/jack000/glid-3-xl
Night Cafe (Free & Paid) https://creator.nightcafe.studio/explore
Disco Diffusion (Free & Paid) https://replicate.com/nightmareai/dis…
Cog View 2 (Free & Paid) https://replicate.com/thudm/cogview2
Pixray (Free & Paid) https://replicate.com/pixray/text2image
Hot Pot AI (free & Paid) https://hotpot.ai/art-maker
There are many other text to image generation tools and still more are being developed. The list above is just a small sample of the many tools available.
Understanding texts#
More than a decade ago, Carl Lewis stood on the threshold of what was
to become the greatest athletics career in history. He had just broken
two of the legendary Jesse Owens' college records, but never believed
he would become a corporate icon, the focus of hundreds of millions of
dollars in advertising. His sport was still nominally amateur.
Eighteen Olympic and World Championship gold medals and 21 world
records later, Lewis has become the richest man in the history of track
and field – a multi-millionaire.
Who is Carl Lewis?
Did Carl Lewis break any world records? (and how do you know that?)
Is Carl Lewis wealthy? What about Jesse Owens?
Carl Lewis is an athlete.
Yes, Carl Lewis broke two of Jesse Owens’ college records.
Yes, Carl Lewis is wealthy. The passage states that he is “the richest man in the history of track and field – a multi-millionaire.” There is no information given about Jesse Owens’ wealth.
The passage is discussing Carl Lewis, an athlete who at the beginning of his career broke two of the college records previously held by the legendary athlete Jesse Owens. Despite not expecting to become a corporate icon or the focus of advertising, Lewis went on to win 18 Olympic and World Championship gold medals and set 21 world records, becoming a multi-millionaire and the richest man in the history of track and field. The passage briefly mentions Jesse Owens, but does not provide information about his wealth.
What does it take to understand the text?#
Understanding text requires several skills and strategies, including:
Reading comprehension: This involves understanding the meaning of individual words and phrases, as well as the larger context and purpose of the text.
Inference: This involves using the information presented in the text to draw conclusions and make predictions about what might happen next.
Background knowledge: This involves using your own prior knowledge and experiences to make sense of the text and fill in any gaps in understanding.
Critical thinking: This involves analyzing the text and evaluating its credibility, identifying any biases or assumptions, and considering multiple perspectives.
In the example passage about Carl Lewis, the reader must use these skills and strategies to understand who he is, his athletic accomplishments, and his wealth. The reader must also infer information not explicitly stated, such as the fact that Jesse Owens is a legendary athlete and the significance of Lewis breaking his records. By applying these skills and strategies, the reader can gain a deeper understanding of the text and its meaning.
Why is NLP hard?#
NLP, or natural language processing, is a challenging field due to several reasons:
Ambiguity: Language is inherently ambiguous, and words or phrases can have multiple meanings depending on the context in which they are used. For example, the word “bank” can refer to a financial institution, a river bank, or the act of tilting or turning something. This ambiguity makes it difficult for NLP systems to accurately interpret a text.
Change: Language is constantly evolving and changing, with new words and phrases being added to the lexicon all the time. NLP systems need to be able to adapt to these changes and stay up-to-date with the latest developments in language use.
Complexity: Language is a complex system with many different levels of structure, from the smallest units of sound (phonemes) to the largest units of meaning (discourse). In order to accurately understand and generate language, NLP systems need to be able to decode the individual sounds and words, as well as the grammar and syntax. Additionally, understanding the intended message of a text requires an understanding of the cultural and historical context in which it was produced.
These challenges make NLP a difficult and constantly evolving field, with new techniques and approaches being developed all the time to improve the accuracy and effectiveness of NLP systems.
Ambiguity#
One of the main challenges in NLP is ambiguity, which can arise at multiple levels:
Word sense: Words can have multiple meanings, such as “bank” referring to a financial institution or a river bank.
Part of speech: The same word can function as different parts of speech, such as “chair” being both a noun and a verb.
Syntactic structure: The same words can be structured in different ways to convey different meanings, such as “I saw the man with the telescope” implying that the man had the telescope, while “I saw the man with the telescope” implies that the speaker had the telescope.
Multiple: Sentences can have multiple sources of ambiguity, such as “I saw her duck” being ambiguous between the bird or an action of lowering oneself.
NLP systems must be able to handle ambiguity in order to accurately understand and process natural language text. This requires sophisticated techniques and algorithms that can disambiguate the multiple possible meanings of a text based on its context and other linguistic features.
The sentence “I saw her duck” is ambiguous and has three possible meanings:
The first meaning is that the speaker saw a duck that belongs to her. In this case, “duck” is a noun and “her” is possessive, indicating ownership.
The second meaning is that the speaker saw her physically duck, meaning that she crouched or moved out of the way. In this case, “duck” is a verb.
The third meaning is a combination of the first two meanings, where the speaker saw her duck down or move out of the way of something. This is a combination of the noun and verb meanings.
This ambiguity is an example of how language is not always straightforward and can be difficult for NLP systems to accurately interpret.
Dealing with ambiguity#
How can we model ambiguity and choose correct analysis in context?
Non-probabilistic methods return all possible analyses.
Probabilistic models return best possible analysis, i.e., most probable one according to the model.
But the “best” analysis is only good if our probabilities are accurate. Where do they come from?
Dealing with ambiguity in NLP is a challenging task. One of the ways to approach this challenge is by modeling ambiguity and choosing the correct analysis in context. Non-probabilistic methods return all possible analyses, while probabilistic models return the best possible analysis, which is the most probable one according to the model. However, it’s important to note that the “best” analysis is only useful if the probabilities used in the model are accurate. Inaccurate probabilities may lead to incorrect analyses and conclusions.
To ensure accurate probabilities, NLP systems require large amounts of high-quality data to be trained on. The quality and size of the data set can significantly impact the accuracy of the model’s predictions. Additionally, the use of advanced techniques such as neural networks and deep learning algorithms can help improve the accuracy of probabilistic models. Another approach to dealing with ambiguity is by incorporating contextual information into the analysis. By taking into account the context in which a word or phrase is used, NLP systems can better determine the intended meaning and reduce ambiguity.
Corpora#
A corpus is a collection of text
Often annotated in some way
Sometimes just lots of raw text
Examples
Penn Treebank: 1M words of parsed Wall Street Journal
Canadian Hansards: 10M+ words of aligned French/English sentences
Yelp reviews
The Web / Common Crawl: billions of words of who knows what
Corpora#
Where do the probabilities for probabilistic models come from?
Probabilistic models rely on statistical analysis of large corpora of text to learn patterns and make predictions. The more data available, the more accurate the probabilities are likely to be. There are several different types of corpora that can be used to train NLP models, including:
General corpora: These contain a wide range of text, such as news articles, books, and web pages.
Domain-specific corpora: These contain text from a particular domain, such as legal documents or medical records.
Treebanks: These are annotated corpora that include information about the syntactic structure of sentences.
Parallel corpora: These contain translations of text from one language to another.
By analyzing the patterns in these corpora, NLP models can learn to recognize common phrases, identify parts of speech, and make predictions about the meaning of words and sentences. However, it is important to note that the accuracy of probabilistic models depends on the quality and size of the training data, as well as the complexity of the language being analyzed.
The eKorpkit Corpus#
The eKorpkit Corpus is a large, diverse, multilingual (ko/en) language modelling dataset. English: 258.83 GiB, Korean: 190.04 GiB, Total: 448.87 GiB
Sparsity#
In the study of natural language processing (NLP), the concept of “sparsity” is used to refer to the phenomenon that a large proportion of words in a corpus occur relatively infrequently, while only a small number of words appear with high frequency. This phenomenon is closely related to Zipf’s Law, which describes the distribution of word frequencies in a large text corpus.
Zipf’s Law describes the distribution of word frequencies in a large text corpus, stating that the frequency of any given word in a corpus is inversely proportional to its rank in the frequency table.
A small number of high-frequency words, such as “the,” “and,” “of,” and “to,” make up a significant proportion of the words in any given corpus, while the vast majority of words appear only a few times or even just once, leading to a long tail of infrequent words.
Sparsity can present challenges for NLP tasks such as text classification, information retrieval, and machine translation, which rely on statistical models that may struggle to accurately capture the meaning and usage of infrequent words.
Unmodeled variables#
In the context of natural language processing (NLP), “unmodeled variables” refer to the factors that may influence the meaning of a text but are not explicitly represented in a computational model or algorithm. One major source of unmodeled variables is world knowledge, which refers to the knowledge and understanding that humans possess about the world and how it works.
I dropped the glass on the floor, and it broke
I dropped the hammer on the glass, and it broke
The two sentences given as examples illustrate the importance of world knowledge in understanding the meaning of text. In the first sentence, “I dropped the glass on the floor, and it broke,” the reader can infer that the glass broke because it was dropped on a hard surface. This inference relies on the reader’s world knowledge about the properties of glass and the effects of dropping objects on hard surfaces.
In the second sentence, “I dropped the hammer on the glass, and it broke,” the reader can infer that the glass broke because it was struck by a hard object. This inference relies on the reader’s world knowledge about the properties of glass and the effects of being struck by hard objects.
The difference in meaning between the two sentences illustrates how unmodeled variables can influence the interpretation of text. In this case, the presence or absence of the word “floor” or “hammer” has a significant impact on the reader’s understanding of what caused the glass to break.
Unknown representation#
“Unknown representation” in natural language processing refers to the challenge of accurately representing the knowledge that humans have and need to communicate and understand language.
It is difficult to capture and represent human knowledge, including the meaning of words and sentences, context, and general knowledge.
The meaning of words and sentences is context-dependent, and can change over time or in different cultural contexts.
Modeling context accurately is a challenge in computational models, particularly with large and complex datasets.
General knowledge, including cultural, scientific, and technical knowledge, is vast and constantly evolving, making it difficult to represent in computational models.
To address this challenge, researchers have developed techniques such as deep learning algorithms, external knowledge sources, and human feedback to refine and improve computational models.
These approaches aim to develop more accurate and robust natural language processing systems that can capture and represent the knowledge needed for successful language communication and understanding.
Traditional NLP Pipelines#
Traditional NLP pipelines are a series of computational tasks that process natural language text in order to extract useful information and knowledge. These pipelines typically consist of several distinct stages, each of which is designed to perform a specific type of analysis on the input text.
The stages of a traditional NLP pipeline typically include:
Tokenizer/Segmenter: This stage is used to identify the boundaries between words and sentences in the input text. This can involve identifying spaces, punctuation marks, or other markers that indicate the end of a word or sentence.
Morphological analyzer/POS-tagger: This stage is used to identify the part of speech and structure of each word in the input text. This can involve identifying the inflectional endings, prefixes, and suffixes that modify the root form of a word, as well as determining its syntactic role in the sentence.
Word sense disambiguation: This stage is used to identify the correct meaning of ambiguous words in the input text. This can involve using contextual clues or external knowledge sources to disambiguate between multiple possible meanings of a word.
Syntactic/semantic Parser: This stage is used to obtain the structure and meaning of each sentence in the input text. This can involve identifying the grammatical relationships between words in the sentence, as well as their underlying semantic roles and relations.
Coreference resolution/discourse model: This stage is used to keep track of the various entities and events mentioned in the input text. This can involve identifying pronouns and other referring expressions that refer to previously mentioned entities, as well as building a coherent discourse model of the text as a whole.
Together, these stages enable NLP systems to extract useful information and knowledge from natural language text, and to perform a wide range of tasks such as information retrieval, machine translation, sentiment analysis, and text classification.
A simple NLP pipeline might include a tokenizer, a part-of-speech tagger, a dependency parser, an entity recognizer, and an output component. This type of pipeline is useful for tasks such as text classification and sentiment analysis.
Input → tokenizer → words → part-of-speech tagger → parts-of-speech → dependency parser → dependencies → entity recognizer → entities → output
The output tends to be a structured representation of the input, such as a parse tree or dependency structure.
A more complex NLP pipeline might include additional processing steps, such as a question-answerer that uses dependencies to identify entities in a question and retrieve an answer from a database. This type of pipeline is useful for more advanced tasks such as natural language understanding and machine translation.
Input → tokenizer → words → part-of-speech tagger → parts-of-speech → dependency parser → dependencies → entity recognizer → entities → question-answerer → answer → output
The question-answerer could use the dependencies to identify the subject, object, and other entities in the question and use this information to retrieve an answer from a database.
Traditional NLP systems often use pipeline-based approaches, where each processing step is implemented as a separate program. This approach offers flexibility and customization for each step, but can also lead to issues with accuracy and robustness, as errors or inaccuracies in one step can affect subsequent steps.
Assumptions of traditional NLP pipelines#
The NLP pipeline makes several assumptions about the nature of natural language and its structure. Each step in the pipeline is designed to provide explicit information about the linguistic structure of the input text. For example, POS tagging identifies the parts of speech of words, while syntactic parsing identifies the grammatical structure of sentences.
In order to achieve this, each step in the pipeline requires its own explicit output representation, which must be defined and designed according to linguistic principles. For example, POS tagging requires a POS tag set, which might include labels such as NN (common noun singular) or NNS (common noun plural), while syntactic parsing requires constituent or dependency labels, such as NP (noun phrase) or nsubj (nominal subject).
The representations used in the NLP pipeline must capture linguistically appropriate generalizations and abstractions. This requires linguistic expertise to design, as well as ongoing refinement and updating as linguistic knowledge and understanding evolves.
Overall, the NLP pipeline assumes that natural language has a structure that can be captured and analyzed computationally, and that this structure can be used to extract useful information and knowledge from text data. While this approach has been successful in many areas of NLP, it also has limitations and challenges, including the need for expert linguistic knowledge, the difficulty of handling unmodeled variables and unknown representations, and the challenges of accurately modeling context and discourse.
Shortcomings of traditional NLP pipelines#
While the NLP pipeline has been successful in many areas of natural language processing, it also has several shortcomings that limit its accuracy and effectiveness.
One major challenge is that each step in the pipeline relies on a learned model that will return the most likely representations. This requires a large amount of annotated training data for each step, which can be expensive and difficult to obtain, as human annotators are not always 100% accurate. Additionally, these models are never 100% accurate themselves, and they are more likely to make mistakes if their input contains errors or inaccuracies.
Another challenge is determining whether the representations used in the pipeline capture the “right” generalizations. While some representations may be easier to predict than others, and some may be more useful for subsequent steps in the pipeline, it is difficult to know how well a particular representation will work until it is incorporated into a model and tested.
Sidestepping the NLP pipeline#
Some current approaches to natural language understanding and generation aim to sidestep the traditional NLP pipeline by going directly from raw input to the desired final output. These approaches often use neural networks and deep learning algorithms to learn the underlying structure and patterns in the input data, without the need for explicit representations or processing steps.
With large amounts of training data, these approaches can often outperform traditional NLP systems. However, they still have limitations and challenges that need to be addressed. For example, incorporating knowledge and reasoning into these models can be difficult, as it requires capturing complex and nuanced relationships between entities and concepts. Additionally, these models may struggle when working with low-resource languages or domains, where training data is limited or scarce.
Neural Nets for NLP#
Neural networks have become increasingly popular in the field of natural language processing (NLP) due to their ability to learn from large amounts of data and capture complex patterns in natural language text.
A neural network is a type of machine learning algorithm that is inspired by the structure of the brain. It is composed of a series of interconnected processing nodes, or neurons, that can learn to recognize patterns of input data.
Some popular neural network models for NLP include:
Long Short-Term Memory (LSTM): a type of recurrent neural network (RNN) that can process sequences of input data and capture long-term dependencies between words or tokens.
Gated Recurrent Unit (GRU): a variant of LSTM that uses fewer parameters and is faster to train.
Transformers: a type of neural network architecture that uses self-attention mechanisms to model the relationships between words or tokens in a sequence.
Bidirectional Encoder Representations from Transformers (BERT): a transformer-based model that can perform tasks such as language modeling, named entity recognition, and sentiment analysis.
Generative Adversarial Networks (GANs): a type of neural network that can generate new text data based on a training dataset.
Question Answering on SQuAD 1.1#
With respect to the SQuAD dataset, human performance is quite high, with a F1 score of around 92% on the test set.
NLP in the Korean Language#
Natural Language Processing (NLP) in the Korean language presents several unique challenges due to the language’s characteristics. For example, the verb “to eat” in Korean is “밥을 먹다” (bapeul meokda). The inflectional ending “-을” (eul) indicates that the verb is being used in the object position, and the noun “밥” (bap) means “rice” or “meal.” This example illustrates the complexity of Korean’s inflectional system, which can make it difficult for NLP systems to accurately identify the meaning of words and sentences.
Korean also has a highly complex grammar system, including the use of honorifics to indicate social status and respect. For example, the verb “to give” in Korean can be conjugated differently depending on the level of formality or politeness required in the situation. This can make it challenging for NLP systems to accurately identify the intended meaning of words and sentences.
In addition, Korean has a high degree of lexical ambiguity, with many words having multiple meanings depending on the context. For example, the word “사과” (sagwa) in Korean can mean either “apple” or “apology,” depending on the context in which it is used. This ambiguity can further complicate the task of identifying the intended meaning of words and sentences using NLP techniques.
Despite these challenges, there are ongoing efforts to develop NLP tools and resources for Korean, including annotated datasets, pre-trained models, and language processing libraries. However, Korean text data is less abundant compared to languages such as English, and there are fewer tools and resources available for NLP in Korean. Additionally, the spacing in Korean text can be arbitrary, especially in informal or colloquial contexts, making it challenging to accurately identify word boundaries and segment text into meaningful units for NLP tasks.
Why NLP is more difficult in Korean?#
NLP is considered more difficult in Korean due to several unique characteristics of the language. While Korean is not considered a low-resource language, it presents challenges for NLP due to the following factors:
Complex grammar: Korean has a highly complex grammar system, including the use of honorifics to indicate social status and respect. This complexity can make it difficult to identify the underlying structure of sentences.
Ambiguity: Korean has a high degree of lexical ambiguity, with many words having multiple meanings depending on the context. This can make it challenging to identify the intended meaning of words and sentences accurately.
Arbitrary spacing: While Korean does use spacing between words, in informal or colloquial contexts, the spacing can be arbitrary and may not strictly follow the linguistic rules or conventions. This can make it difficult to accurately identify word boundaries and segment text into meaningful units for NLP tasks.
Limited training data: Compared to languages such as English, Korean has less training data available for NLP tasks, which can make it challenging to develop accurate and robust models.
Limited resources: There are fewer tools and resources available for NLP in Korean, such as annotated datasets and pre-trained models. This can make it more challenging to develop and deploy NLP systems.
NLP in Low-Resource Languages#
NLP in low-resource languages can be particularly challenging due to the lack of linguistic resources, including annotated datasets, pre-trained models, and language processing libraries. Low-resource languages typically have fewer speakers and are often underrepresented in digital text, making it difficult to collect sufficient data to train NLP models.
This lack of data can lead to poor performance when using standard NLP techniques that rely on large amounts of annotated training data. For example, traditional NLP techniques such as rule-based systems, statistical machine translation, and language models may not be effective in low-resource languages due to the lack of data available to train these models.
To address these challenges, researchers are exploring new approaches to NLP in low-resource languages, such as transfer learning, unsupervised learning, and data augmentation. Transfer learning, in particular, has shown promising results in low-resource settings, where models pre-trained on high-resource languages can be fine-tuned on smaller datasets in low-resource languages.
In addition to technical challenges, NLP in low-resource languages also raises ethical considerations. For example, NLP systems trained on small, unrepresentative datasets may reinforce biases and stereotypes present in the data, leading to discriminatory outcomes. Therefore, it is essential to ensure that NLP systems are developed and evaluated with a critical lens, taking into account the social and cultural contexts in which they will be used.
Overall, NLP in low-resource languages is an active area of research, with ongoing efforts to develop tools, resources, and techniques to overcome the challenges posed by the lack of linguistic resources. By leveraging new approaches and ensuring that NLP systems are developed ethically and responsibly, we can help promote linguistic diversity and empower communities who speak low-resource languages.