Lab: Korean Text Processing#
In this lab, we’ll practice Korean text processing using two important libraries: eKoNLPy
and KSS
. The first library, eKoNLPy
, is used for tokenizing and tagging Korean texts, while the second library, KSS
, is used for sentence segmentation.
I. Setup#
Install the necessary libraries by executing the following command:
%pip install ekonlpy
# %pip install kss
Show code cell output
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: ekonlpy in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (1.1.7)
Requirement already satisfied: nltk<4.0.0,>=3.8.1 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from ekonlpy) (3.8.1)
Requirement already satisfied: scipy<2.0.0,>=1.10.1 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from ekonlpy) (1.10.1)
Requirement already satisfied: fugashi<2.0.0,>=1.2.1 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from ekonlpy) (1.2.1)
Requirement already satisfied: pandas<2.0.0,>=1.5.3 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from ekonlpy) (1.5.3)
Requirement already satisfied: mecab-ko-dic<2.0.0,>=1.0.0 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from ekonlpy) (1.0.0)
Requirement already satisfied: click in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from nltk<4.0.0,>=3.8.1->ekonlpy) (8.1.3)
Requirement already satisfied: regex>=2021.8.3 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from nltk<4.0.0,>=3.8.1->ekonlpy) (2023.3.23)
Requirement already satisfied: tqdm in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from nltk<4.0.0,>=3.8.1->ekonlpy) (4.65.0)
Requirement already satisfied: joblib in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from nltk<4.0.0,>=3.8.1->ekonlpy) (1.2.0)
Requirement already satisfied: pytz>=2020.1 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from pandas<2.0.0,>=1.5.3->ekonlpy) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from pandas<2.0.0,>=1.5.3->ekonlpy) (2.8.2)
Requirement already satisfied: numpy>=1.20.3 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from pandas<2.0.0,>=1.5.3->ekonlpy) (1.23.5)
Requirement already satisfied: six>=1.5 in /home/yjlee/.cache/pypoetry/virtualenvs/lecture-_dERj_9R-py3.8/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas<2.0.0,>=1.5.3->ekonlpy) (1.16.0)
[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
II. eKoNLPy#
eKoNLPy
is a Python library for Korean natural language processing. It provides functions for tokenizing Korean text into words, part-of-speech tagging, and morpheme analysis.
Example 1: Part-of-Speech Tagging with eKoNLPy#
from ekonlpy.mecab import MeCab
mecab = MeCab()
print(mecab.pos("금통위는 따라서 물가안정과 병행, 경기상황에 유의하는 금리정책을 펼쳐나가기로 했다고 밝혔다."))
[('금', 'MAJ'), ('통', 'MAG'), ('위', 'NNG'), ('는', 'JX'), ('따라서', 'MAJ'), ('물가', 'NNG'), ('안정', 'NNG'), ('과', 'JC'), ('병행', 'NNG'), (',', 'SC'), ('경기', 'NNG'), ('상황', 'NNG'), ('에', 'JKB'), ('유의', 'NNG'), ('하', 'XSV'), ('는', 'ETM'), ('금리', 'NNG'), ('정책', 'NNG'), ('을', 'JKO'), ('펼쳐', 'VV+EC'), ('나가', 'VX'), ('기', 'ETN'), ('로', 'JKB'), ('했', 'VV+EP'), ('다고', 'EC'), ('밝혔', 'VV+EP'), ('다', 'EF'), ('.', 'SF')]
The output of the code above will be a list of tuples, where the first element of each tuple is a word and the second element is the corresponding part-of-speech tag.
Example 2: Adding Words to Dictionary#
from ekonlpy.tag import Mecab
mecab = Mecab()
mecab.add_dictionary("금통위", "NNG")
print(mecab.pos("금통위는 따라서 물가안정과 병행, 경기상황에 유의하는 금리정책을 펼쳐나가기로 했다고 밝혔다."))
[('금통위', 'NNG'), ('는', 'JX'), ('따라서', 'MAJ'), ('물가', 'NNG'), ('안정', 'NNG'), ('과', 'JC'), ('병행', 'NNG'), (',', 'SC'), ('경기', 'NNG'), ('상황', 'NNG'), ('에', 'JKB'), ('유의', 'NNG'), ('하', 'XSV'), ('는', 'ETM'), ('금리', 'NNG'), ('정책', 'NNG'), ('을', 'JKO'), ('펼쳐', 'VV'), ('나가', 'VX'), ('기', 'ETN'), ('로', 'JKB'), ('했', 'VV'), ('다고', 'EC'), ('밝혔', 'VV'), ('다', 'EF'), ('.', 'SF')]
With this method, you can extend the user dictionary by adding words with their corresponding part-of-speech tags.
III. KSS (Korean Sentence Splitter)#
KSS
is a Python library for splitting Korean text into sentences. It uses machine learning algorithms to accurately segment Korean text into sentences.
Example 3: Sentence Segmentation with KSS#
import kss
text = """
일본기상청과 태평양지진해일경보센터는 3월 11일 오후 2시 49분경에 일본 동해안을 비롯하여 대만, 알래스카, 하와이, 괌,
캘리포니아, 칠레 등 태평양 연안 50여 국가에 지진해일 주의보와 경보를 발령하였다. 다행히도 우리나라는 지진발생위치로부터
1,000km 이상 떨어진데다 일본 열도가 가로막아 지진해일이 도달하지 않았다. 지진해일은 일본 소마항에 7.3m, 카마이시항에 4.1m,
미야코항에 4m 등 일본 동해안 전역에서 관측되었다. 지진해일이 원해로 전파되면서 대만(19시 40분)에서 소규모 지진해일과 하와이
섬에서 1.4m(23시 9분)의 지진해일이 관측되었다. 다음날인 3월 12일 새벽 1시 57분경에는 진앙지로부터 약 7,500km 떨어진
캘리포니아 크레센트시티에서 2.2m의 지진해일이 관측되었다.
"""
print(kss.split_sentences(text))
The output of the code above will be a list of sentences split from the input text.
Other Korean Tokenizers#
There are many other Korean tokenizers available, each with their own strengths and weaknesses. These tokenizers are available in the konlpy
library.
Here’s a brief summary of each:
Kkma: Developed by Seoul National University’s Intelligent Data Systems lab, this morpheme analyzer uses dynamic programming. It’s based on Java, and can be slow.
Komoran: Developed by Shineware, this Java-based library is unique in that it can analyze multiple words as a single part of speech, making it effective for analyzing proper nouns that include spaces.
Mecab: This is a version of a Japanese morpheme analyzer that has been adapted for Korean. It supports adding user dictionaries to recognize new words from different domains.
Hannanum: Developed by the KAIST Semantic Web Research Center, this analyzer provides automatic spacing and spell correction based on its morpheme analysis.
Open Korean Text (Okt): Formerly known as the Twitter morpheme analyzer, it’s good at recognizing names, neologisms, and other language trends on social media. It’s fast but its morpheme analysis quality is relatively low.
All these tools, including eKoNLPy, contribute to the field of Korean NLP by providing methods for tokenizing text, tagging parts of speech, and analyzing the morphological structure of Korean text.
Kkma#
from konlpy.tag import Kkma
kkma = Kkma()
sentence = "안녕하세요. 저는 AI 모델입니다."
print(kkma.morphs(sentence))
Komoran#
from konlpy.tag import Komoran
komoran = Komoran()
sentence = "안녕하세요. 저는 AI 모델입니다."
print(komoran.morphs(sentence))
Mecab#
To use Mecab in Korean, you need to install ‘konlpy’ and ‘mecab-python3’.
from konlpy.tag import Mecab
mecab = Mecab()
sentence = "안녕하세요. 저는 AI 모델입니다."
print(mecab.morphs(sentence))
Hannanum#
from konlpy.tag import Hannanum
hannanum = Hannanum()
sentence = "안녕하세요. 저는 AI 모델입니다."
print(hannanum.morphs(sentence))
Open Korean Text (Okt)#
from konlpy.tag import Okt
okt = Okt()
sentence = "안녕하세요. 저는 AI 모델입니다."
print(okt.morphs(sentence))
Each of these tokenizers will take the input sentence and return a list of morphemes as a result. The method to get this list is morphs()
. Remember, these tokenizers may return different results due to differences in their algorithmic approach.