Preparing training datasets#
from ekorpkit import eKonf
if eKonf.is_colab():
eKonf.mount_google_drive()
ws = eKonf.set_workspace(
workspace="/workspace",
project="ekorpkit-book/exmaples",
task="esg",
log_level="INFO"
)
print("version:", ws.version)
print("project_dir:", ws.project_dir)
Show code cell output
INFO:ekorpkit.base:Set environment variable EKORPKIT_DATA_ROOT=/workspace/data
INFO:ekorpkit.base:Set environment variable CACHED_PATH_CACHE_ROOT=/workspace/.cache/cached_path
version: 0.1.40.post0.dev57
project_dir: /workspace/projects/ekorpkit-book/exmaples
time: 955 ms (started: 2022-12-16 04:03:41 +00:00)
Fetch the labeled dataset from the labelstudio server#
from ekorpkit.io.fetch.labelstudio import LabelStudio
ls = LabelStudio()
time: 644 ms (started: 2022-12-16 04:04:13 +00:00)
project_list = ls.list_projects(verbose=True)
13: ESG Topic Classification (Sep 2022)
12: ESG Polarity Classification (Sep 2022)
3: ESG Topic Classification
2: ESG Polarity Classification
time: 1.12 s (started: 2022-12-16 04:04:20 +00:00)
ls.name = "esg_polarity_labels"
esg_polarity_data = ls.export_annotations(project_id=12)
print(esg_polarity_data.shape)
esg_polarity_data.head()
INFO:ekorpkit.io.fetch.labelstudio:fetching http://ekorpkit-labelstudio:8080/api/projects/12/export with {'exportType': 'JSON'} to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_annotations.json
INFO:ekorpkit.io.fetch.labelstudio:/workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_annotations.json is downloaded
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_export.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/configs/esg_polarity_labels(2)_config.yaml
(1158, 6)
id | text | annot_id | annotator | origin | labels | |
---|---|---|---|---|---|---|
0 | 97756 | 세계 백신 생산의 60%를 담당한 인도가 자국 코로나19 확산 탓에 기능을 못하는 ... | 20328 | 3 | prediction | Neutral |
1 | 97667 | 美서 한미 백신기업 파트너십 행사 개최 22일(현지시간) 미국 워싱턴 DC에서 열... | 20327 | 3 | prediction | Neutral |
2 | 97650 | [머니투데이 워싱턴=공동취재단 , 서울=이소은 기자] 삼성바이오로직스가 22일(이... | 20326 | 3 | prediction | Neutral |
3 | 97627 | 영국표준협회 ISO22301 취득 2018년 1·2공장 국내 첫 인증후 최근 3공... | 20325 | 3 | prediction | Neutral |
4 | 97593 | 경구용 치료제 2만 명분 이미 확보 식품의약품안전처가 경구용 코로나19 치료제(먹... | 20324 | 3 | prediction | Neutral |
time: 6.1 s (started: 2022-12-16 04:04:22 +00:00)
ls.name = "esg_topic_labels"
esg_topic_data = ls.export_annotations(project_id=13)
print(esg_topic_data.shape)
esg_topic_data.head()
INFO:ekorpkit.io.fetch.labelstudio:fetching http://ekorpkit-labelstudio:8080/api/projects/13/export with {'exportType': 'JSON'} to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_annotations.json
INFO:ekorpkit.io.fetch.labelstudio:/workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_annotations.json is downloaded
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_export.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/configs/esg_topic_labels(2)_config.yaml
(576, 6)
id | text | annot_id | annotator | origin | labels | |
---|---|---|---|---|---|---|
0 | 201831 | 지역>충남 | 경제>금융_재테크 | 지역>울산 [머니투데이 구경민 기자] 미래에셋... | 20337 | 1 | prediction | S-사회공헌 |
1 | 201600 | 패밀리 오피스 분야 전락적 MOU 이창헌(앞줄 오른쪽) 한국M&A거래소 회장과 류... | 20336 | 1 | prediction | UNKNOWN |
2 | 201319 | 01%)를 미래에셋증권과 이음프라이빗에쿼티(PE) 컨소시엄에 4500억원에 매각하기... | 20335 | 1 | prediction | UNKNOWN |
3 | 200829 | 다이렉트 IRP 운용·자산관리 수수료 전부 면제를 통한 비용부담 해소 은행 및 보... | 20334 | 1 | prediction | UNKNOWN |
4 | 199893 | 1999년 12월 자본금 500억원에 설립된 미래에셋증권은 약 20년 만에 200배... | 20333 | 1 | prediction | G-지배구조 |
time: 3.42 s (started: 2022-12-16 04:04:28 +00:00)
Snorkel LabelModel#
from ekorpkit.tasks.label.snorkel import BaseSnorkel
snorkel = BaseSnorkel(name="esg_polarity_snorkel")
INFO:ekorpkit.config:Init batch - Batch name: esg_polarity_snorkel, Batch num: 2
INFO:ekorpkit.config:Init batch - Batch name: esg_polarity_snorkel, Batch num: 2
time: 640 ms (started: 2022-12-14 11:40:17 +00:00)
data_file = "/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet"
snorkel.load_datasets(data_files=data_file, test_split_ratio=0.2, seed=12345)
INFO:ekorpkit.io.file:Processing [1] files from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet
INFO:ekorpkit.datasets.config:Splitting the dataframe into train and test with ratio 0.2
INFO:ekorpkit.datasets.config:Train data: (926, 7), Test data: (232, 7)
INFO:ekorpkit.datasets.config:Shuffling the dataframe with seed 12345
INFO:ekorpkit.datasets.config:Train data: (926, 7)
INFO:ekorpkit.datasets.config:Test data: (232, 7)
INFO:ekorpkit.io.file:Concatenating 2 dataframes
time: 23.6 ms (started: 2022-12-14 11:40:18 +00:00)
Writing Labeling Functions#
Each crowdworker can be thought of as a single labeling function, as each worker labels a subset of data points, and may have errors or conflicting labels with other workers / labeling functions. Labeling fucntions will simply return the label the worker submitted for a given text, and abstain if they didn’t submit a label for it.
Crowdworker labeling functions#
eKonf.viewsource(snorkel.compose_worker_lfs)
def compose_worker_lfs(self):
labels_by_annotator = self.data.groupby(self.columns.annotator)
worker_dicts = {}
for worker_id in labels_by_annotator.groups:
worker_df = labels_by_annotator.get_group(worker_id)
worker_dicts[worker_id] = dict(zip(worker_df.id, worker_df.classes))
log.info(f"Number of workers: {len(worker_dicts)}")
def worker_lf(x, worker_dict):
return worker_dict.get(x.id, self.ABSTAIN)
def make_worker_lf(worker_id):
worker_dict = worker_dicts[worker_id]
name = f"worker_{worker_id}"
return LabelingFunction(
name, f=worker_lf, resources={"worker_dict": worker_dict}
)
worker_lfs = [make_worker_lf(worker_id) for worker_id in worker_dicts]
self.__worker_lfs__ = worker_lfs
return worker_lfs
time: 1.34 ms (started: 2022-12-13 10:56:35 +00:00)
snorkel.data
INFO:ekorpkit.io.file:Concatenating 2 dataframes
id | text | annot_id | annotator | origin | labels | classes | |
---|---|---|---|---|---|---|---|
0 | 99621 | 31일 김동관 한화솔루션 전략부문 대표는 P4G 기본세션 에너지부문 '더 푸르른 지... | 19868 | 3 | prediction | Neutral | 1 |
1 | 90918 | 김 변호사는 2001년 LG화학이 화학분야 중간지주회사인 LG CI를 인적 분할할 ... | 18929 | 2 | prediction | Neutral | 1 |
2 | 108056 | [머니투데이 김성은 기자] 한화솔루션의 그린에너지 사업부문인 한화큐셀이 글로벌 기... | 19926 | 3 | prediction | Positive | 2 |
3 | 90740 | 2차전지 분야에서도 LG화학(현재는 LG에너지솔루션으로 분사)의 법인세 부담률은 ... | 18927 | 2 | prediction | Neutral | 1 |
4 | 93872 | 김동관, P4G 정상회의 기조연설에서 기업 역할 강조 [아시아경제 황윤주 기자] ... | 19823 | 3 | prediction | Positive | 2 |
... | ... | ... | ... | ... | ... | ... | ... |
1153 | 102985 | MIT 등 주요 10여개 대학 석·박사 및 학부생 대상 신학철 LG화학(05191... | 19106 | 2 | prediction | Neutral | 1 |
1154 | 108198 | 18일 금융투자 업계에 따르면 LG화학은 17일 오후 4시 주주와 투자자를 대상으로... | 19198 | 2 | prediction | Neutral | 1 |
1155 | 89454 | LG화학은 오는 16일까지 중국 선전(深圳)에서 열리는 '차이나플라스 2021'에... | 18904 | 2 | prediction | Positive | 2 |
1156 | 128683 | [아시아경제 황윤주 기자] LG화학의 유럽 폴란드 공장이 지속가능경영의 모범사례로... | 19628 | 2 | prediction | Positive | 2 |
1157 | 109789 | 한화솔루션은 오는 7월부터 2년 간 총 48t(톤)의 수소를 공급한다 또 이후 충... | 19939 | 3 | prediction | Positive | 2 |
1158 rows × 7 columns
time: 8.95 ms (started: 2022-12-14 11:41:05 +00:00)
worker_lfs = snorkel.compose_worker_lfs()
INFO:ekorpkit.tasks.label.snorkel:Number of workers: 3
time: 3.54 ms (started: 2022-12-13 10:56:38 +00:00)
snorkel.apply_worker_lfs(worker_lfs)
INFO:ekorpkit.tasks.label.snorkel:Applying worker lfs to train data
100%|██████████| 926/926 [00:00<00:00, 34148.60it/s]
INFO:ekorpkit.tasks.label.snorkel:Applying worker lfs to test data
100%|██████████| 232/232 [00:00<00:00, 33436.83it/s]
time: 40.2 ms (started: 2022-12-13 10:56:39 +00:00)
summary = snorkel.lf_summary()
summary
INFO:ekorpkit.tasks.label.snorkel:Training set coverage: 100.0%
j | Polarity | Coverage | Overlaps | Conflicts | Correct | Incorrect | Emp. Acc. | |
---|---|---|---|---|---|---|---|---|
worker_1 | 0 | [0, 1, 2] | 0.012959 | 0.0 | 0.0 | 12 | 0 | 1.0 |
worker_2 | 1 | [0, 1, 2] | 0.588553 | 0.0 | 0.0 | 545 | 0 | 1.0 |
worker_3 | 2 | [0, 1, 2] | 0.398488 | 0.0 | 0.0 | 369 | 0 | 1.0 |
time: 13.8 ms (started: 2022-12-13 10:56:40 +00:00)
Train LabelModel And Generate Probabilistic Labels#
snorkel.fit()
100%|██████████| 100/100 [00:01<00:00, 86.08epoch/s]
time: 1.19 s (started: 2022-12-13 10:56:43 +00:00)
snorkel.eval()
LabelModel Accuracy for train: 1.000
LabelModel Accuracy for test: 1.000
time: 5.52 ms (started: 2022-12-13 10:56:45 +00:00)
preds = snorkel.predict()
100%|██████████| 1158/1158 [00:00<00:00, 37554.83it/s]
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(1)_preds.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/configs/esg_polarity_snorkel(1)_config.yaml
time: 581 ms (started: 2022-12-13 10:56:48 +00:00)
snorkel.save_preds(preds, columns = ['id', 'text', "labels"])
eKonf.load_data("esg_polarity_snorkel(0)_preds.parquet", snorkel.batch_dir)
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(1)_preds.parquet
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_snorkel(0)_preds.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(0)_preds.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(0)_preds.parquet
id | text | labels | |
---|---|---|---|
0 | 97756 | 세계 백신 생산의 60%를 담당한 인도가 자국 코로나19 확산 탓에 기능을 못하는 ... | Neutral |
1 | 97667 | 美서 한미 백신기업 파트너십 행사 개최 22일(현지시간) 미국 워싱턴 DC에서 열... | Neutral |
2 | 97650 | [머니투데이 워싱턴=공동취재단 , 서울=이소은 기자] 삼성바이오로직스가 22일(이... | Neutral |
3 | 97627 | 영국표준협회 ISO22301 취득 2018년 1·2공장 국내 첫 인증후 최근 3공... | Neutral |
4 | 97593 | 경구용 치료제 2만 명분 이미 확보 식품의약품안전처가 경구용 코로나19 치료제(먹... | Neutral |
... | ... | ... | ... |
1153 | 88550 | 산업통상자원부는 현대글로비스와 LG화학의 전기차 택시에 대한 배터리 대여사업을 승인... | Neutral |
1154 | 88439 | [헤럴드경제 정세희 기자]신학철 LG화학 부회장은 “동시대를 앞서나가는 디지털 전... | Positive |
1155 | 88413 | 갈등의 골이 깊어지면서 '시계제로' 상태에 빠진 아시아나항공 인수전이 무산될 가능성... | Negative |
1156 | 88407 | 그간 경남도와 창원시는 주력산업인 제조업의 성장이 둔화하면서 어려운 시기를 겪었으나... | Positive |
1157 | 88382 | 이와 함께 지난 7일 한샘은 삼성전자와 전략적 사업협력 업무협약(MOU)을 체결한... | Neutral |
1158 rows × 3 columns
time: 140 ms (started: 2022-12-13 10:56:54 +00:00)
Build a dataset using the data generated by the label model#
cfg = eKonf.compose("dataset=dataset_build")
cfg.name = "esg_polarity_kr"
cfg.data_dir = data_dir
cfg.data_file = "esg_polarity_snorkel_data.parquet"
cfg.force.build = True
cfg.pipeline.split_sampling.stratify_on = "labels"
cfg.pipeline.split_sampling.random_state = 123
cfg.pipeline.split_sampling.test_size = 0.2
cfg.pipeline.split_sampling.dev_size = 0.2
cfg.pipeline.reset_index.drop_index = True
cfg.verbose = False
esg_polarity_ds = eKonf.instantiate(cfg)
esg_polarity_ds.persist()
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('load_dataframe', 'load_dataframe'), ('reset_index', 'reset_index'), ('split_sampling', 'split_sampling')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function load_dataframe at 0x7f553803bf70>)
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_snorkel_data.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_snorkel_data.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_snorkel_data.parquet
INFO:ekorpkit.base:Applying pipe: functools.partial(<function reset_index at 0x7f553803b1f0>)
INFO:ekorpkit.base:Applying pipe: functools.partial(<function split_sampling at 0x7f5538032dc0>)
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.datasets.base:Dataset esg_polarity_kr built with 13616 rows
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: None, columns: ['id', 'text', 'labels'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-dev.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [dev]
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-test.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [test]
INFO:ekorpkit.base:Using batcher with minibatch size: 38
INFO:ekorpkit.utils.batch.batcher: backend: joblib minibatch_size: 38 procs: 230 input_split: False merge_output: True len(data): 8713 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.276040
INFO:ekorpkit.base:Using batcher with minibatch size: 10
INFO:ekorpkit.utils.batch.batcher: backend: joblib minibatch_size: 10 procs: 230 input_split: False merge_output: True len(data): 2179 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.250046
INFO:ekorpkit.base:Using batcher with minibatch size: 12
INFO:ekorpkit.utils.batch.batcher: backend: joblib minibatch_size: 12 procs: 230 input_split: False merge_output: True len(data): 2724 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.253649
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet