Preparing training datasets

Contents

Preparing training datasets#

from ekorpkit import eKonf

if eKonf.is_colab():
    eKonf.mount_google_drive()
ws = eKonf.set_workspace(
    workspace="/workspace", 
    project="ekorpkit-book/exmaples", 
    task="esg", 
    log_level="INFO"
)
print("version:", ws.version)
print("project_dir:", ws.project_dir)

Fetch the labeled dataset from the labelstudio server#

from ekorpkit.io.fetch.labelstudio import LabelStudio

ls = LabelStudio()

time: 644 ms (started: 2022-12-16 04:04:13 +00:00)

project_list = ls.list_projects(verbose=True)

ESG Topic Classification (Sep 2022)
ESG Polarity Classification (Sep 2022)
ESG Topic Classification
ESG Polarity Classification
time: 1.12 s (started: 2022-12-16 04:04:20 +00:00)

ls.name = "esg_polarity_labels"
esg_polarity_data = ls.export_annotations(project_id=12)
print(esg_polarity_data.shape)
esg_polarity_data.head()

INFO:ekorpkit.io.fetch.labelstudio:fetching http://ekorpkit-labelstudio:8080/api/projects/12/export with {'exportType': 'JSON'} to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_annotations.json
INFO:ekorpkit.io.fetch.labelstudio:/workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_annotations.json is downloaded
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_export.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/configs/esg_polarity_labels(2)_config.yaml

(1158, 6)

	id	text	annot_id	annotator	origin	labels
0	97756	세계 백신 생산의 60%를 담당한 인도가 자국 코로나19 확산 탓에 기능을 못하는 ...	20328	3	prediction	Neutral
1	97667	美서 한미 백신기업 파트너십 행사 개최 22일(현지시간) 미국 워싱턴 DC에서 열...	20327	3	prediction	Neutral
2	97650	[머니투데이 워싱턴=공동취재단 , 서울=이소은 기자] 삼성바이오로직스가 22일(이...	20326	3	prediction	Neutral
3	97627	영국표준협회 ISO22301 취득 2018년 1·2공장 국내 첫 인증후 최근 3공...	20325	3	prediction	Neutral
4	97593	경구용 치료제 2만 명분 이미 확보 식품의약품안전처가 경구용 코로나19 치료제(먹...	20324	3	prediction	Neutral

time: 6.1 s (started: 2022-12-16 04:04:22 +00:00)

ls.name = "esg_topic_labels"
esg_topic_data = ls.export_annotations(project_id=13)
print(esg_topic_data.shape)
esg_topic_data.head()

INFO:ekorpkit.io.fetch.labelstudio:fetching http://ekorpkit-labelstudio:8080/api/projects/13/export with {'exportType': 'JSON'} to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_annotations.json
INFO:ekorpkit.io.fetch.labelstudio:/workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_annotations.json is downloaded
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_export.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/configs/esg_topic_labels(2)_config.yaml

(576, 6)

	id	text	annot_id	annotator	origin	labels
0	201831	지역>충남 \| 경제>금융_재테크 \| 지역>울산 [머니투데이 구경민 기자] 미래에셋...	20337	1	prediction	S-사회공헌
1	201600	패밀리 오피스 분야 전락적 MOU 이창헌(앞줄 오른쪽) 한국M&A거래소 회장과 류...	20336	1	prediction	UNKNOWN
2	201319	01%)를 미래에셋증권과 이음프라이빗에쿼티(PE) 컨소시엄에 4500억원에 매각하기...	20335	1	prediction	UNKNOWN
3	200829	다이렉트 IRP 운용·자산관리 수수료 전부 면제를 통한 비용부담 해소 은행 및 보...	20334	1	prediction	UNKNOWN
4	199893	1999년 12월 자본금 500억원에 설립된 미래에셋증권은 약 20년 만에 200배...	20333	1	prediction	G-지배구조

time: 3.42 s (started: 2022-12-16 04:04:28 +00:00)

Snorkel LabelModel#

from ekorpkit.tasks.label.snorkel import BaseSnorkel

snorkel = BaseSnorkel(name="esg_polarity_snorkel")

INFO:ekorpkit.config:Init batch - Batch name: esg_polarity_snorkel, Batch num: 2
INFO:ekorpkit.config:Init batch - Batch name: esg_polarity_snorkel, Batch num: 2

time: 640 ms (started: 2022-12-14 11:40:17 +00:00)

data_file = "/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet"

snorkel.load_datasets(data_files=data_file, test_split_ratio=0.2, seed=12345)

INFO:ekorpkit.io.file:Processing [1] files from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet
INFO:ekorpkit.datasets.config:Splitting the dataframe into train and test with ratio 0.2
INFO:ekorpkit.datasets.config:Train data: (926, 7), Test data: (232, 7)
INFO:ekorpkit.datasets.config:Shuffling the dataframe with seed 12345
INFO:ekorpkit.datasets.config:Train data: (926, 7)
INFO:ekorpkit.datasets.config:Test data: (232, 7)
INFO:ekorpkit.io.file:Concatenating 2 dataframes

time: 23.6 ms (started: 2022-12-14 11:40:18 +00:00)

Writing Labeling Functions#

Each crowdworker can be thought of as a single labeling function, as each worker labels a subset of data points, and may have errors or conflicting labels with other workers / labeling functions. Labeling fucntions will simply return the label the worker submitted for a given text, and abstain if they didn’t submit a label for it.

Crowdworker labeling functions#

eKonf.viewsource(snorkel.compose_worker_lfs)

    def compose_worker_lfs(self):
        labels_by_annotator = self.data.groupby(self.columns.annotator)
        worker_dicts = {}
        for worker_id in labels_by_annotator.groups:
            worker_df = labels_by_annotator.get_group(worker_id)
            worker_dicts[worker_id] = dict(zip(worker_df.id, worker_df.classes))

        log.info(f"Number of workers: {len(worker_dicts)}")

        def worker_lf(x, worker_dict):
            return worker_dict.get(x.id, self.ABSTAIN)

        def make_worker_lf(worker_id):
            worker_dict = worker_dicts[worker_id]
            name = f"worker_{worker_id}"
            return LabelingFunction(
                name, f=worker_lf, resources={"worker_dict": worker_dict}
            )

        worker_lfs = [make_worker_lf(worker_id) for worker_id in worker_dicts]
        self.__worker_lfs__ = worker_lfs
        return worker_lfs

time: 1.34 ms (started: 2022-12-13 10:56:35 +00:00)

snorkel.data

INFO:ekorpkit.io.file:Concatenating 2 dataframes

	id	text	annot_id	annotator	origin	labels	classes
0	99621	31일 김동관 한화솔루션 전략부문 대표는 P4G 기본세션 에너지부문 '더 푸르른 지...	19868	3	prediction	Neutral	1
1	90918	김 변호사는 2001년 LG화학이 화학분야 중간지주회사인 LG CI를 인적 분할할 ...	18929	2	prediction	Neutral	1
2	108056	[머니투데이 김성은 기자] 한화솔루션의 그린에너지 사업부문인 한화큐셀이 글로벌 기...	19926	3	prediction	Positive	2
3	90740	2차전지 분야에서도 LG화학(현재는 LG에너지솔루션으로 분사)의 법인세 부담률은 ...	18927	2	prediction	Neutral	1
4	93872	김동관, P4G 정상회의 기조연설에서 기업 역할 강조 [아시아경제 황윤주 기자] ...	19823	3	prediction	Positive	2
...	...	...	...	...	...	...	...
1153	102985	MIT 등 주요 10여개 대학 석·박사 및 학부생 대상 신학철 LG화학(05191...	19106	2	prediction	Neutral	1
1154	108198	18일 금융투자 업계에 따르면 LG화학은 17일 오후 4시 주주와 투자자를 대상으로...	19198	2	prediction	Neutral	1
1155	89454	LG화학은 오는 16일까지 중국 선전(深圳)에서 열리는 '차이나플라스 2021'에...	18904	2	prediction	Positive	2
1156	128683	[아시아경제 황윤주 기자] LG화학의 유럽 폴란드 공장이 지속가능경영의 모범사례로...	19628	2	prediction	Positive	2
1157	109789	한화솔루션은 오는 7월부터 2년 간 총 48t(톤)의 수소를 공급한다 또 이후 충...	19939	3	prediction	Positive	2

1158 rows × 7 columns

time: 8.95 ms (started: 2022-12-14 11:41:05 +00:00)

worker_lfs = snorkel.compose_worker_lfs()

INFO:ekorpkit.tasks.label.snorkel:Number of workers: 3

time: 3.54 ms (started: 2022-12-13 10:56:38 +00:00)

snorkel.apply_worker_lfs(worker_lfs)

INFO:ekorpkit.tasks.label.snorkel:Applying worker lfs to train data
100%|██████████| 926/926 [00:00<00:00, 34148.60it/s]
INFO:ekorpkit.tasks.label.snorkel:Applying worker lfs to test data
100%|██████████| 232/232 [00:00<00:00, 33436.83it/s]

time: 40.2 ms (started: 2022-12-13 10:56:39 +00:00)

summary = snorkel.lf_summary()
summary

INFO:ekorpkit.tasks.label.snorkel:Training set coverage:  100.0%

	j	Polarity	Coverage	Overlaps	Conflicts	Correct	Incorrect	Emp. Acc.
worker_1	0	[0, 1, 2]	0.012959	0.0	0.0	12	0	1.0
worker_2	1	[0, 1, 2]	0.588553	0.0	0.0	545	0	1.0
worker_3	2	[0, 1, 2]	0.398488	0.0	0.0	369	0	1.0

time: 13.8 ms (started: 2022-12-13 10:56:40 +00:00)

Train LabelModel And Generate Probabilistic Labels#

snorkel.fit()

100%|██████████| 100/100 [00:01<00:00, 86.08epoch/s]

time: 1.19 s (started: 2022-12-13 10:56:43 +00:00)

snorkel.eval()

LabelModel Accuracy for train: 1.000
LabelModel Accuracy for test: 1.000
time: 5.52 ms (started: 2022-12-13 10:56:45 +00:00)

preds = snorkel.predict()

100%|██████████| 1158/1158 [00:00<00:00, 37554.83it/s]
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(1)_preds.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/configs/esg_polarity_snorkel(1)_config.yaml

time: 581 ms (started: 2022-12-13 10:56:48 +00:00)

snorkel.save_preds(preds, columns = ['id', 'text', "labels"])
eKonf.load_data("esg_polarity_snorkel(0)_preds.parquet", snorkel.batch_dir)

INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(1)_preds.parquet
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_snorkel(0)_preds.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(0)_preds.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(0)_preds.parquet

	id	text	labels
0	97756	세계 백신 생산의 60%를 담당한 인도가 자국 코로나19 확산 탓에 기능을 못하는 ...	Neutral
1	97667	美서 한미 백신기업 파트너십 행사 개최 22일(현지시간) 미국 워싱턴 DC에서 열...	Neutral
2	97650	[머니투데이 워싱턴=공동취재단 , 서울=이소은 기자] 삼성바이오로직스가 22일(이...	Neutral
3	97627	영국표준협회 ISO22301 취득 2018년 1·2공장 국내 첫 인증후 최근 3공...	Neutral
4	97593	경구용 치료제 2만 명분 이미 확보 식품의약품안전처가 경구용 코로나19 치료제(먹...	Neutral
...	...	...	...
1153	88550	산업통상자원부는 현대글로비스와 LG화학의 전기차 택시에 대한 배터리 대여사업을 승인...	Neutral
1154	88439	[헤럴드경제 정세희 기자]신학철 LG화학 부회장은 “동시대를 앞서나가는 디지털 전...	Positive
1155	88413	갈등의 골이 깊어지면서 '시계제로' 상태에 빠진 아시아나항공 인수전이 무산될 가능성...	Negative
1156	88407	그간 경남도와 창원시는 주력산업인 제조업의 성장이 둔화하면서 어려운 시기를 겪었으나...	Positive
1157	88382	이와 함께 지난 7일 한샘은 삼성전자와 전략적 사업협력 업무협약(MOU)을 체결한...	Neutral

1158 rows × 3 columns

time: 140 ms (started: 2022-12-13 10:56:54 +00:00)

Build a dataset using the data generated by the label model#

cfg = eKonf.compose("dataset=dataset_build")
cfg.name = "esg_polarity_kr"
cfg.data_dir = data_dir
cfg.data_file = "esg_polarity_snorkel_data.parquet"
cfg.force.build = True
cfg.pipeline.split_sampling.stratify_on = "labels"
cfg.pipeline.split_sampling.random_state = 123
cfg.pipeline.split_sampling.test_size = 0.2
cfg.pipeline.split_sampling.dev_size = 0.2
cfg.pipeline.reset_index.drop_index = True
cfg.verbose = False
esg_polarity_ds = eKonf.instantiate(cfg)
esg_polarity_ds.persist()

INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('load_dataframe', 'load_dataframe'), ('reset_index', 'reset_index'), ('split_sampling', 'split_sampling')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function load_dataframe at 0x7f553803bf70>)
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_snorkel_data.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_snorkel_data.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_snorkel_data.parquet
INFO:ekorpkit.base:Applying pipe: functools.partial(<function reset_index at 0x7f553803b1f0>)
INFO:ekorpkit.base:Applying pipe: functools.partial(<function split_sampling at 0x7f5538032dc0>)
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.datasets.base:Dataset esg_polarity_kr built with 13616 rows
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: None, columns: ['id', 'text', 'labels'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-dev.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [dev]
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-test.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [test]
INFO:ekorpkit.base:Using batcher with minibatch size: 38
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 38  procs: 230  input_split: False  merge_output: True  len(data): 8713 len(args): 5

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.276040
INFO:ekorpkit.base:Using batcher with minibatch size: 10
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 10  procs: 230  input_split: False  merge_output: True  len(data): 2179 len(args): 5

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.250046
INFO:ekorpkit.base:Using batcher with minibatch size: 12
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 12  procs: 230  input_split: False  merge_output: True  len(data): 2724 len(args): 5

INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.253649
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet