Preparing training datasets#

from ekorpkit import eKonf

if eKonf.is_colab():
    eKonf.mount_google_drive()
ws = eKonf.set_workspace(
    workspace="/workspace", 
    project="ekorpkit-book/exmaples", 
    task="esg", 
    log_level="INFO"
)
print("version:", ws.version)
print("project_dir:", ws.project_dir)
Hide code cell output
INFO:ekorpkit.base:Set environment variable EKORPKIT_DATA_ROOT=/workspace/data
INFO:ekorpkit.base:Set environment variable CACHED_PATH_CACHE_ROOT=/workspace/.cache/cached_path
version: 0.1.40.post0.dev57
project_dir: /workspace/projects/ekorpkit-book/exmaples
time: 955 ms (started: 2022-12-16 04:03:41 +00:00)

Fetch the labeled dataset from the labelstudio server#

from ekorpkit.io.fetch.labelstudio import LabelStudio

ls = LabelStudio()
time: 644 ms (started: 2022-12-16 04:04:13 +00:00)
project_list = ls.list_projects(verbose=True)
13: ESG Topic Classification (Sep 2022)
12: ESG Polarity Classification (Sep 2022)
3: ESG Topic Classification
2: ESG Polarity Classification
time: 1.12 s (started: 2022-12-16 04:04:20 +00:00)
ls.name = "esg_polarity_labels"
esg_polarity_data = ls.export_annotations(project_id=12)
print(esg_polarity_data.shape)
esg_polarity_data.head()
INFO:ekorpkit.io.fetch.labelstudio:fetching http://ekorpkit-labelstudio:8080/api/projects/12/export with {'exportType': 'JSON'} to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_annotations.json
INFO:ekorpkit.io.fetch.labelstudio:/workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_annotations.json is downloaded
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/esg_polarity_labels(2)_export.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_polarity_labels/configs/esg_polarity_labels(2)_config.yaml
(1158, 6)
id text annot_id annotator origin labels
0 97756 세계 백신 생산의 60%를 담당한 인도가 자국 코로나19 확산 탓에 기능을 못하는 ... 20328 3 prediction Neutral
1 97667 美서 한미 백신기업 파트너십 행사 개최 22일(현지시간) 미국 워싱턴 DC에서 열... 20327 3 prediction Neutral
2 97650 [머니투데이 워싱턴=공동취재단 , 서울=이소은 기자] 삼성바이오로직스가 22일(이... 20326 3 prediction Neutral
3 97627 영국표준협회 ISO22301 취득 2018년 1·2공장 국내 첫 인증후 최근 3공... 20325 3 prediction Neutral
4 97593 경구용 치료제 2만 명분 이미 확보 식품의약품안전처가 경구용 코로나19 치료제(먹... 20324 3 prediction Neutral
time: 6.1 s (started: 2022-12-16 04:04:22 +00:00)
ls.name = "esg_topic_labels"
esg_topic_data = ls.export_annotations(project_id=13)
print(esg_topic_data.shape)
esg_topic_data.head()
INFO:ekorpkit.io.fetch.labelstudio:fetching http://ekorpkit-labelstudio:8080/api/projects/13/export with {'exportType': 'JSON'} to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_annotations.json
INFO:ekorpkit.io.fetch.labelstudio:/workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_annotations.json is downloaded
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/esg_topic_labels(2)_export.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/outputs/esg_topic_labels/configs/esg_topic_labels(2)_config.yaml
(576, 6)
id text annot_id annotator origin labels
0 201831 지역>충남 | 경제>금융_재테크 | 지역>울산 [머니투데이 구경민 기자] 미래에셋... 20337 1 prediction S-사회공헌
1 201600 패밀리 오피스 분야 전락적 MOU 이창헌(앞줄 오른쪽) 한국M&A거래소 회장과 류... 20336 1 prediction UNKNOWN
2 201319 01%)를 미래에셋증권과 이음프라이빗에쿼티(PE) 컨소시엄에 4500억원에 매각하기... 20335 1 prediction UNKNOWN
3 200829 다이렉트 IRP 운용·자산관리 수수료 전부 면제를 통한 비용부담 해소 은행 및 보... 20334 1 prediction UNKNOWN
4 199893 1999년 12월 자본금 500억원에 설립된 미래에셋증권은 약 20년 만에 200배... 20333 1 prediction G-지배구조
time: 3.42 s (started: 2022-12-16 04:04:28 +00:00)

Snorkel LabelModel#

from ekorpkit.tasks.label.snorkel import BaseSnorkel

snorkel = BaseSnorkel(name="esg_polarity_snorkel")
INFO:ekorpkit.config:Init batch - Batch name: esg_polarity_snorkel, Batch num: 2
INFO:ekorpkit.config:Init batch - Batch name: esg_polarity_snorkel, Batch num: 2
time: 640 ms (started: 2022-12-14 11:40:17 +00:00)
data_file = "/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet"

snorkel.load_datasets(data_files=data_file, test_split_ratio=0.2, seed=12345)
INFO:ekorpkit.io.file:Processing [1] files from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_labels/esg_polarity_labels(1)_export.parquet
INFO:ekorpkit.datasets.config:Splitting the dataframe into train and test with ratio 0.2
INFO:ekorpkit.datasets.config:Train data: (926, 7), Test data: (232, 7)
INFO:ekorpkit.datasets.config:Shuffling the dataframe with seed 12345
INFO:ekorpkit.datasets.config:Train data: (926, 7)
INFO:ekorpkit.datasets.config:Test data: (232, 7)
INFO:ekorpkit.io.file:Concatenating 2 dataframes
time: 23.6 ms (started: 2022-12-14 11:40:18 +00:00)

Writing Labeling Functions#

Each crowdworker can be thought of as a single labeling function, as each worker labels a subset of data points, and may have errors or conflicting labels with other workers / labeling functions. Labeling fucntions will simply return the label the worker submitted for a given text, and abstain if they didn’t submit a label for it.

Crowdworker labeling functions#

eKonf.viewsource(snorkel.compose_worker_lfs)
    def compose_worker_lfs(self):
        labels_by_annotator = self.data.groupby(self.columns.annotator)
        worker_dicts = {}
        for worker_id in labels_by_annotator.groups:
            worker_df = labels_by_annotator.get_group(worker_id)
            worker_dicts[worker_id] = dict(zip(worker_df.id, worker_df.classes))

        log.info(f"Number of workers: {len(worker_dicts)}")

        def worker_lf(x, worker_dict):
            return worker_dict.get(x.id, self.ABSTAIN)

        def make_worker_lf(worker_id):
            worker_dict = worker_dicts[worker_id]
            name = f"worker_{worker_id}"
            return LabelingFunction(
                name, f=worker_lf, resources={"worker_dict": worker_dict}
            )

        worker_lfs = [make_worker_lf(worker_id) for worker_id in worker_dicts]
        self.__worker_lfs__ = worker_lfs
        return worker_lfs

time: 1.34 ms (started: 2022-12-13 10:56:35 +00:00)
snorkel.data
INFO:ekorpkit.io.file:Concatenating 2 dataframes
id text annot_id annotator origin labels classes
0 99621 31일 김동관 한화솔루션 전략부문 대표는 P4G 기본세션 에너지부문 '더 푸르른 지... 19868 3 prediction Neutral 1
1 90918 김 변호사는 2001년 LG화학이 화학분야 중간지주회사인 LG CI를 인적 분할할 ... 18929 2 prediction Neutral 1
2 108056 [머니투데이 김성은 기자] 한화솔루션의 그린에너지 사업부문인 한화큐셀이 글로벌 기... 19926 3 prediction Positive 2
3 90740 2차전지 분야에서도 LG화학(현재는 LG에너지솔루션으로 분사)의 법인세 부담률은 ... 18927 2 prediction Neutral 1
4 93872 김동관, P4G 정상회의 기조연설에서 기업 역할 강조 [아시아경제 황윤주 기자] ... 19823 3 prediction Positive 2
... ... ... ... ... ... ... ...
1153 102985 MIT 등 주요 10여개 대학 석·박사 및 학부생 대상 신학철 LG화학(05191... 19106 2 prediction Neutral 1
1154 108198 18일 금융투자 업계에 따르면 LG화학은 17일 오후 4시 주주와 투자자를 대상으로... 19198 2 prediction Neutral 1
1155 89454 LG화학은 오는 16일까지 중국 선전(深圳)에서 열리는 '차이나플라스 2021'에... 18904 2 prediction Positive 2
1156 128683 [아시아경제 황윤주 기자] LG화학의 유럽 폴란드 공장이 지속가능경영의 모범사례로... 19628 2 prediction Positive 2
1157 109789 한화솔루션은 오는 7월부터 2년 간 총 48t(톤)의 수소를 공급한다 또 이후 충... 19939 3 prediction Positive 2

1158 rows × 7 columns

time: 8.95 ms (started: 2022-12-14 11:41:05 +00:00)
worker_lfs = snorkel.compose_worker_lfs()
INFO:ekorpkit.tasks.label.snorkel:Number of workers: 3
time: 3.54 ms (started: 2022-12-13 10:56:38 +00:00)
snorkel.apply_worker_lfs(worker_lfs)
INFO:ekorpkit.tasks.label.snorkel:Applying worker lfs to train data
100%|██████████| 926/926 [00:00<00:00, 34148.60it/s]
INFO:ekorpkit.tasks.label.snorkel:Applying worker lfs to test data
100%|██████████| 232/232 [00:00<00:00, 33436.83it/s]
time: 40.2 ms (started: 2022-12-13 10:56:39 +00:00)

summary = snorkel.lf_summary()
summary
INFO:ekorpkit.tasks.label.snorkel:Training set coverage:  100.0%
j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
worker_1 0 [0, 1, 2] 0.012959 0.0 0.0 12 0 1.0
worker_2 1 [0, 1, 2] 0.588553 0.0 0.0 545 0 1.0
worker_3 2 [0, 1, 2] 0.398488 0.0 0.0 369 0 1.0
time: 13.8 ms (started: 2022-12-13 10:56:40 +00:00)

Train LabelModel And Generate Probabilistic Labels#

snorkel.fit()
100%|██████████| 100/100 [00:01<00:00, 86.08epoch/s]
time: 1.19 s (started: 2022-12-13 10:56:43 +00:00)

snorkel.eval()
LabelModel Accuracy for train: 1.000
LabelModel Accuracy for test: 1.000
time: 5.52 ms (started: 2022-12-13 10:56:45 +00:00)
preds = snorkel.predict()
100%|██████████| 1158/1158 [00:00<00:00, 37554.83it/s]
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(1)_preds.parquet
INFO:ekorpkit.config:Saving config to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/configs/esg_polarity_snorkel(1)_config.yaml
time: 581 ms (started: 2022-12-13 10:56:48 +00:00)
snorkel.save_preds(preds, columns = ['id', 'text', "labels"])
eKonf.load_data("esg_polarity_snorkel(0)_preds.parquet", snorkel.batch_dir)
INFO:ekorpkit.io.file:Saving dataframe to /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(1)_preds.parquet
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_snorkel(0)_preds.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(0)_preds.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit-book/exmaples/esg/data/outputs/esg_polarity_snorkel/esg_polarity_snorkel(0)_preds.parquet
id text labels
0 97756 세계 백신 생산의 60%를 담당한 인도가 자국 코로나19 확산 탓에 기능을 못하는 ... Neutral
1 97667 美서 한미 백신기업 파트너십 행사 개최 22일(현지시간) 미국 워싱턴 DC에서 열... Neutral
2 97650 [머니투데이 워싱턴=공동취재단 , 서울=이소은 기자] 삼성바이오로직스가 22일(이... Neutral
3 97627 영국표준협회 ISO22301 취득 2018년 1·2공장 국내 첫 인증후 최근 3공... Neutral
4 97593 경구용 치료제 2만 명분 이미 확보 식품의약품안전처가 경구용 코로나19 치료제(먹... Neutral
... ... ... ...
1153 88550 산업통상자원부는 현대글로비스와 LG화학의 전기차 택시에 대한 배터리 대여사업을 승인... Neutral
1154 88439 [헤럴드경제 정세희 기자]신학철 LG화학 부회장은 “동시대를 앞서나가는 디지털 전... Positive
1155 88413 갈등의 골이 깊어지면서 '시계제로' 상태에 빠진 아시아나항공 인수전이 무산될 가능성... Negative
1156 88407 그간 경남도와 창원시는 주력산업인 제조업의 성장이 둔화하면서 어려운 시기를 겪었으나... Positive
1157 88382 이와 함께 지난 7일 한샘은 삼성전자와 전략적 사업협력 업무협약(MOU)을 체결한... Neutral

1158 rows × 3 columns

time: 140 ms (started: 2022-12-13 10:56:54 +00:00)

Build a dataset using the data generated by the label model#

cfg = eKonf.compose("dataset=dataset_build")
cfg.name = "esg_polarity_kr"
cfg.data_dir = data_dir
cfg.data_file = "esg_polarity_snorkel_data.parquet"
cfg.force.build = True
cfg.pipeline.split_sampling.stratify_on = "labels"
cfg.pipeline.split_sampling.random_state = 123
cfg.pipeline.split_sampling.test_size = 0.2
cfg.pipeline.split_sampling.dev_size = 0.2
cfg.pipeline.reset_index.drop_index = True
cfg.verbose = False
esg_polarity_ds = eKonf.instantiate(cfg)
esg_polarity_ds.persist()
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('load_dataframe', 'load_dataframe'), ('reset_index', 'reset_index'), ('split_sampling', 'split_sampling')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function load_dataframe at 0x7f553803bf70>)
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_snorkel_data.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_snorkel_data.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_snorkel_data.parquet
INFO:ekorpkit.base:Applying pipe: functools.partial(<function reset_index at 0x7f553803b1f0>)
INFO:ekorpkit.base:Applying pipe: functools.partial(<function split_sampling at 0x7f5538032dc0>)
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.datasets.base:Dataset esg_polarity_kr built with 13616 rows
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-train.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.info.column:index: index, index of data: None, columns: ['id', 'text', 'labels'], id: ['id']
INFO:ekorpkit.info.column:Adding id [split] to ['id']
INFO:ekorpkit.info.column:Added id [split], now ['id', 'split']
INFO:ekorpkit.info.column:Added a column [split] with value [train]
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-dev.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [dev]
INFO:ekorpkit.io.file:Processing [1] files from ['esg_polarity_kr-test.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet
INFO:ekorpkit.info.column:Added a column [split] with value [test]
INFO:ekorpkit.base:Using batcher with minibatch size: 38
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 38  procs: 230  input_split: False  merge_output: True  len(data): 8713 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.276040
INFO:ekorpkit.base:Using batcher with minibatch size: 10
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 10  procs: 230  input_split: False  merge_output: True  len(data): 2179 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.250046
INFO:ekorpkit.base:Using batcher with minibatch size: 12
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 12  procs: 230  input_split: False  merge_output: True  len(data): 2724 len(args): 5
INFO:ekorpkit.info.stat: >> elapsed time to calculate statistics: 0:00:00.253649
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-train.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-dev.parquet
INFO:ekorpkit.io.file:Saving dataframe to ../data/esg/esg_polarity_kr/esg_polarity_kr-test.parquet