Predicting Sentiments of FOMC Corpus#

Analyse statement by Loughran and McDonald dictionary, finbert, and T5 model

%config InlineBackend.figure_format='retina'
from ekorpkit import eKonf

logging.basicConfig(level=logging.WARNING)
print("version:", eKonf.__version__)
print("is notebook?", eKonf.is_notebook())
print("is colab?", eKonf.is_colab())
print("evironment varialbles:")
eKonf.print(eKonf.env().dict())
version: 0.1.33+28.g90d1dea
is notebook? True
is colab? False
evironment varialbles:
{'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_WORKSPACE_ROOT': '/workspace',
 'NUM_WORKERS': 230}
start_year = 2000
data_dir = "../data/fomc"
eKonf.env().FRED_API_KEY
pydantic.types.SecretStr

Predict sentiments with the LM sentiment analyser#

Load FOMC Corpus#

fomc_sents = eKonf.load_data("fomc_sents.parquet", data_dir)
fomc_sents.tail()
id text split timestamp content_type date speaker title decision rate recent_meeting recent_decision recent_rate next_meeting next_decision next_rate text_num_words section_id sent_id
653463 2854 It will not have the word “somewhat” on line 3. train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 2014-12-17 0.0 0.25 2015-01-28 0.0 0.25 10 287 2
653464 2854 Chair Yellen Yes Vice Chairman Dudley ... train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 2014-12-17 0.0 0.25 2015-01-28 0.0 0.25 31 287 4
653465 2854 And let me confirm that the next meeting will ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 2014-12-17 0.0 0.25 2015-01-28 0.0 0.25 19 288 3
653466 2854 I believe box lunches are now available for pe... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 2014-12-17 0.0 0.25 2015-01-28 0.0 0.25 33 288 4
653467 2854 I will do my best, and I will consider at the ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 2014-12-17 0.0 0.25 2015-01-28 0.0 0.25 18 288 5

Predict sentiments of sentences#

model_cfg = eKonf.compose('model/sentiment=lm')
model_cfg.num_workers = 100
lmsa = eKonf.instantiate(model_cfg)
article = fomc_sents.text[10]
lmsa.predict_sentence(article)
{'num_tokens': 156,
 'polarity': -0.9999990000010001,
 'polarity_label': 'negative',
 'uncertainty': 1e-06}
fomc_sent_sentiments = lmsa.predict(fomc_sents)
eKonf.save_data(fomc_sent_sentiments, "fomc_sent_sentiments.parquet", data_dir)
INFO:ekorpkit.models.sentiment.lbsa:Predicting sentiments of the column [text] using predict_sentence
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 100  input_split: False  merge_output: True  len(data): 653468 len(args): 5
Predicting [text]: 100%|██████████| 654/654 [02:50<00:00,  3.83it/s]
INFO:ekorpkit.models.sentiment.lbsa: >> elapsed time to predict: 0:02:51.902053
id text split timestamp content_type date speaker title decision rate ... next_meeting next_decision next_rate text_num_words section_id sent_id num_tokens polarity polarity_label uncertainty
0 0 The Secretary reported that advices of the ele... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 47 29 0 52 0.000000 neutral 0.000001
1 0 By unanimous vote, the Committee elected the f... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 73 35 0 78 -1.000000 negative 0.000001
2 0 By unanimous vote, William J. McDonough, Marga... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 74 37 0 83 1.000000 positive 0.000001
3 0 On January 15, 1993, the continuing rules, reg... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 59 39 0 68 -0.333333 negative 0.014707
4 0 Members were asked to indicate if they wished ... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 25 39 1 26 -0.999999 negative 0.000001
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
653463 2854 It will not have the word “somewhat” on line 3. train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 10 287 2 13 0.000000 neutral 0.076924
653464 2854 Chair Yellen Yes Vice Chairman Dudley ... train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 31 287 4 31 0.000000 neutral 0.000001
653465 2854 And let me confirm that the next meeting will ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 19 288 3 21 0.000000 neutral 0.000001
653466 2854 I believe box lunches are now available for pe... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 33 288 4 39 0.000000 neutral 0.025642
653467 2854 I will do my best, and I will consider at the ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 18 288 5 22 0.000000 neutral 0.000001

653468 rows × 23 columns

fomc_sent_sentiments = eKonf.load_data("fomc_sent_sentiments.parquet", data_dir)
fomc_sent_sentiments
id text split timestamp content_type date speaker title decision rate ... next_meeting next_decision next_rate text_num_words section_id sent_id num_tokens polarity polarity_label uncertainty
0 0 The Secretary reported that advices of the ele... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 47 29 0 52 0.000000 neutral 0.000001
1 0 By unanimous vote, the Committee elected the f... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 73 35 0 78 -1.000000 negative 0.000001
2 0 By unanimous vote, William J. McDonough, Marga... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 74 37 0 83 1.000000 positive 0.000001
3 0 On January 15, 1993, the continuing rules, reg... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 59 39 0 68 -0.333333 negative 0.014707
4 0 Members were asked to indicate if they wished ... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 1993-02-18 0.0 3.00 25 39 1 26 -0.999999 negative 0.000001
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
653463 2854 It will not have the word “somewhat” on line 3. train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 10 287 2 13 0.000000 neutral 0.076924
653464 2854 Chair Yellen Yes Vice Chairman Dudley ... train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 31 287 4 31 0.000000 neutral 0.000001
653465 2854 And let me confirm that the next meeting will ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 19 288 3 21 0.000000 neutral 0.000001
653466 2854 I believe box lunches are now available for pe... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 33 288 4 39 0.000000 neutral 0.025642
653467 2854 I will do my best, and I will consider at the ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 2015-01-28 0.0 0.25 18 288 5 22 0.000000 neutral 0.000001

653468 rows × 23 columns

Aggregate sentiment scores#

fomc_tones_lm = lmsa.aggregate_scores(fomc_sent_sentiments, groupby=['content_type', 'date'])
fomc_tones_lm.content_type = fomc_tones_lm.content_type.str.replace('fomc_', '')
eKonf.save_data(fomc_tones_lm, 'fomc_tones_lm.parquet', data_dir)
fomc_tones_lm = eKonf.load_data('fomc_tones_lm.parquet', data_dir)
fomc_tones_lm
content_type date polarity_mean polarity_diffusion positive negative num_tokens_sum num_tokens_mean num_tokens_median num_examples polarity_mean_label polarity_diffusion_label
0 beigebook 2021-01-13 -0.068247 -0.071053 3972 5229 397855 22.489119 21.0 17691 neutral neutral
1 beigebook 2021-03-03 -0.032683 -0.037384 3887 4518 381300 22.590201 21.0 16879 neutral neutral
2 beigebook 2021-04-14 -0.030535 -0.035085 3568 4100 340121 22.430983 21.0 15163 neutral neutral
3 beigebook 2021-06-02 -0.039873 -0.044760 3837 4561 357745 22.117156 21.0 16175 neutral neutral
4 beigebook 2021-07-14 0.011743 0.012195 191 182 15803 21.413279 21.0 738 neutral neutral
... ... ... ... ... ... ... ... ... ... ... ... ...
2390 testimony 2021-05-19 -0.171642 -0.164179 15 26 2590 38.656716 29.0 67 negative negative
2391 testimony 2021-06-22 0.220126 0.207547 20 9 1456 27.471698 24.0 53 positive positive
2392 testimony 2021-07-14 0.259259 0.305556 14 3 1066 29.611111 27.0 36 positive positive
2393 testimony 2021-09-28 0.031746 0.000000 9 9 1026 24.428571 22.0 42 neutral neutral
2394 testimony 2021-11-30 -0.120000 -0.200000 3 7 556 27.800000 26.0 20 neutral negative

2395 rows × 12 columns

cfg = eKonf.compose('pipeline/pivot')
cfg.index = 'date'
cfg.columns = 'content_type'
cfg.values = ['polarity_mean', 'polarity_diffusion', 'num_examples', 'num_tokens_sum', 'num_tokens_mean']
tone_data_lm = eKonf.pipe(fomc_tones_lm, cfg)
tone_data_lm = eKonf.to_datetime(tone_data_lm, _columns='date')
tone_data_lm = tone_data_lm.set_index('date')
eKonf.save_data(tone_data_lm, 'fomc_tone_data_lm.parquet', data_dir)
tone_data_lm = eKonf.load_data('fomc_tone_data_lm.parquet', data_dir)
tone_data_lm
polarity_mean_beigebook polarity_mean_meeting_script polarity_mean_minutes polarity_mean_press_conf polarity_mean_speech polarity_mean_statement polarity_mean_testimony polarity_diffusion_beigebook polarity_diffusion_meeting_script polarity_diffusion_minutes ... num_tokens_sum_speech num_tokens_sum_statement num_tokens_sum_testimony num_tokens_mean_beigebook num_tokens_mean_meeting_script num_tokens_mean_minutes num_tokens_mean_press_conf num_tokens_mean_speech num_tokens_mean_statement num_tokens_mean_testimony
date
1990-02-07 NaN -0.087583 NaN NaN NaN NaN NaN NaN -0.095663 NaN ... NaN NaN NaN NaN 30.213010 NaN NaN NaN NaN NaN
1990-03-27 NaN -0.171992 NaN NaN NaN NaN NaN NaN -0.179702 NaN ... NaN NaN NaN NaN 29.846369 NaN NaN NaN NaN NaN
1990-05-15 NaN -0.116052 NaN NaN NaN NaN NaN NaN -0.125461 NaN ... NaN NaN NaN NaN 29.749077 NaN NaN NaN NaN NaN
1990-07-03 NaN -0.114829 NaN NaN NaN NaN NaN NaN -0.117794 NaN ... NaN NaN NaN NaN 29.667920 NaN NaN NaN NaN NaN
1990-08-21 NaN -0.209552 NaN NaN NaN NaN NaN NaN -0.219403 NaN ... NaN NaN NaN NaN 31.032836 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-11-30 NaN NaN NaN NaN -0.167014 NaN -0.12 NaN NaN NaN ... 3066.0 NaN 556.0 NaN NaN NaN NaN 31.937500 NaN 27.8
2021-12-01 -0.046022 NaN NaN NaN NaN NaN NaN -0.048109 NaN NaN ... NaN NaN NaN 22.539497 NaN NaN NaN NaN NaN NaN
2021-12-02 NaN NaN NaN NaN -0.077381 NaN NaN NaN NaN NaN ... 6514.0 NaN NaN NaN NaN NaN NaN 36.188889 NaN NaN
2021-12-15 NaN NaN -0.043929 -0.075441 NaN 0.166667 NaN NaN NaN -0.064286 ... NaN 489.0 NaN NaN NaN 30.521429 37.587413 NaN 27.166667 NaN
2021-12-17 NaN NaN NaN NaN -0.356613 NaN NaN NaN NaN NaN ... 3694.0 NaN NaN NaN NaN NaN NaN 29.317460 NaN NaN

1876 rows × 35 columns

Predict sentiments and aggregate scores with a pipeline#

model_cfg = eKonf.compose('model/sentiment=lm')

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['predict', 'aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sents.parquet"
cfg.data.data_dir = data_dir
cfg.predict.model = model_cfg
cfg.predict.path.output.base_dir = data_dir
cfg.predict.path.output.filename = "fomc_sent_sentiments.parquet"
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_mean', 'polarity_diffusion', 'num_examples', 'num_tokens_sum', 'num_tokens_mean']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_tone_data_lm.parquet'
tone_data_lm = eKonf.instantiate(cfg)
tone_data_lm = eKonf.to_datetime(tone_data_lm, _columns='date')
tone_data_lm = tone_data_lm.set_index('date')
eKonf.save_data(tone_data_lm, 'fomc_tone_data_lm.parquet', data_dir)
INFO:ekorpkit.io.file:Processing [1] files from ['fomc_sents.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/fomc/fomc_sents.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/fomc/fomc_sents.parquet
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('predict', 'predict'), ('aggregate_scores', 'aggregate_scores'), ('replace', 'replace'), ('pivot', 'pivot'), ('save_dataframe', 'save_dataframe')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f0abf86c820>)
INFO:ekorpkit.preprocessors.tokenizer:instantiating ekorpkit.preprocessors.stopwords.Stopwords...
INFO:ekorpkit.base:Calling load_candidates
INFO:ekorpkit.io.file:Processing [1] files from ['/workspace/projects/ekorpkit/ekorpkit/resources/lexicons/LM.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit/ekorpkit/resources/lexicons/LM.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit/ekorpkit/resources/lexicons/LM.parquet
INFO:ekorpkit.models.ngram.ngram:loaded 58142 candidates
INFO:ekorpkit.models.sentiment.lbsa:Predicting sentiments of the column [text] using predict_sentence
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 100  input_split: False  merge_output: True  len(data): 653468 len(args): 5
Predicting [text]: 100%|██████████| 654/654 [03:06<00:00,  3.51it/s]
INFO:ekorpkit.models.sentiment.lbsa: >> elapsed time to predict: 0:03:07.279688
INFO:ekorpkit.pipelines.pipe:Saving data to: {'file': None, 'filename': 'fomc_sent_sentiments.parquet', 'base_dir': '../data/fomc', 'filetype': '', 'columns': None, 'suffix': None, 'filepath': '../data/fomc/fomc_sent_sentiments.parquet'}
INFO:ekorpkit.io.file:Saving dataframe to ../data/fomc/fomc_sent_sentiments.parquet
INFO:ekorpkit.base:Applying pipe: functools.partial(<function aggregate_scores at 0x7f0abf86c8b0>)
INFO:ekorpkit.base:instantiating ekorpkit.models.sentiment.base.BaseSentimentAnalyser...
INFO:ekorpkit.pipelines.pipe:filename not specified
INFO:ekorpkit.base:Applying pipe: functools.partial(<function general_function at 0x7f0abf86cc10>)
INFO:ekorpkit.pipelines.pipe:processing column: content_type
INFO:ekorpkit.pipelines.pipe: >> elapsed time to replace: 0:00:00.003644
INFO:ekorpkit.base:Applying pipe: functools.partial(<function pivot at 0x7f0abf86c1f0>)
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_dataframe at 0x7f0abf7043a0>)
INFO:ekorpkit.io.file:Saving dataframe to ../data/fomc/fomc_sentiment_data.parquet
polarity_mean_beigebook polarity_mean_meeting_script polarity_mean_minutes polarity_mean_press_conf polarity_mean_speech polarity_mean_statement polarity_mean_testimony polarity_diffusion_beigebook polarity_diffusion_meeting_script polarity_diffusion_minutes ... num_tokens_sum_speech num_tokens_sum_statement num_tokens_sum_testimony num_tokens_mean_beigebook num_tokens_mean_meeting_script num_tokens_mean_minutes num_tokens_mean_press_conf num_tokens_mean_speech num_tokens_mean_statement num_tokens_mean_testimony
recent_meeting
1990-02-07 NaN -0.087583 NaN NaN NaN NaN NaN NaN -0.095663 NaN ... NaN NaN NaN NaN 30.213010 NaN NaN NaN NaN NaN
1990-03-27 NaN -0.171992 NaN NaN NaN NaN NaN NaN -0.179702 NaN ... NaN NaN NaN NaN 29.846369 NaN NaN NaN NaN NaN
1990-05-15 NaN -0.116052 NaN NaN NaN NaN NaN NaN -0.125461 NaN ... NaN NaN NaN NaN 29.749077 NaN NaN NaN NaN NaN
1990-07-03 NaN -0.114829 NaN NaN NaN NaN NaN NaN -0.117794 NaN ... NaN NaN NaN NaN 29.667920 NaN NaN NaN NaN NaN
1990-08-21 NaN -0.209552 NaN NaN NaN NaN NaN NaN -0.219403 NaN ... NaN NaN NaN NaN 31.032836 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-06-16 0.011743 NaN 0.041638 -0.017544 -0.031786 0.435897 0.235955 0.012195 NaN 0.031142 ... 6894.0 384.0 2522.0 21.413279 NaN 30.615917 27.100000 29.088608 29.538462 28.337079
2021-07-28 -0.120547 NaN -0.043969 0.021318 -0.042941 0.461538 NaN -0.134921 NaN -0.069079 ... 13170.0 399.0 NaN 20.980159 NaN 32.888158 27.261628 29.072848 30.692308 NaN
2021-09-22 -0.074328 NaN -0.079199 -0.087292 -0.133837 0.476190 0.031746 -0.075712 NaN -0.112403 ... 31138.0 419.0 1026.0 22.957808 NaN 31.348837 25.809384 29.431002 29.928571 24.428571
2021-11-03 -0.046022 NaN -0.064255 -0.089881 -0.030345 0.215686 -0.120000 -0.048109 NaN -0.080851 ... 26096.0 538.0 556.0 22.539497 NaN 31.880851 27.720238 31.980392 31.647059 27.800000
2021-12-15 NaN NaN -0.043929 -0.075441 -0.356613 0.166667 NaN NaN NaN -0.064286 ... 3694.0 489.0 NaN NaN NaN 30.521429 37.587413 29.317460 27.166667 NaN

286 rows × 35 columns

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sent_sentiments.parquet"
cfg.data.data_dir = data_dir
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_mean', 'polarity_diffusion', 'num_examples', 'num_tokens_sum', 'num_tokens_mean']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_tone_data_lm.parquet'
tone_data_lm = eKonf.instantiate(cfg)
tone_data_lm = eKonf.to_datetime(tone_data_lm, _columns='date')
tone_data_lm = tone_data_lm.set_index('date')
eKonf.save_data(tone_data_lm, 'fomc_tone_data_lm.parquet', data_dir)

Predict sentiments with the finbert#

ds_cfg = eKonf.compose('dataset')
ds_cfg.name = 'financial_phrasebank'
ds_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/financial_phrasebank.zip'
ds_cfg.data_dir = ds_cfg.path.cached_path
ds_cfg.verbose = False

overrides=[
    '+model/transformer=classification',
    '+model/transformer/pretrained=finbert',
]
model_cfg = eKonf.compose('model/transformer=classification', overrides)
model_cfg.name = 'fomc_finbert'
model_cfg.dataset = ds_cfg
model_cfg.verbose = False
model_cfg.config.num_train_epochs = 2
model_cfg.config.max_seq_length = 256
model_cfg.config.train_batch_size = 32
model_cfg.config.eval_batch_size = 32
model_cfg._method_ = ['eval']
# model_cfg._method_ = ['train', 'eval']
finbert_model = eKonf.instantiate(model_cfg)
Accuracy:  0.8960176991150443
Precison:  0.8957698961367982
Recall:  0.8960176991150443
F1 Score:  0.8958088377661515
Model Report: 
___________________________________________________
              precision    recall  f1-score   support

    negative       0.81      0.77      0.79        61
     neutral       0.96      0.96      0.96       277
    positive       0.79      0.81      0.80       114

    accuracy                           0.90       452
   macro avg       0.85      0.85      0.85       452
weighted avg       0.90      0.90      0.90       452
../../../_images/b2e9a96506f9e457c17cf16198c49396b93e29bbc2fd153bdd62aa9c587f3a2e.png
model_cfg._method_ = []
cfg = eKonf.compose(config_group='pipeline')
cfg.name = 'fomc_sent_sentiments'
cfg.data_dir = data_dir
cfg.data_file = "fomc_sents.parquet"
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = data_dir
cfg.predict.output_file = f'{cfg.name}_finbert.parquet'
fomc_sent_sentiments_finbert = eKonf.instantiate(cfg)
fomc_sent_sentiments_finbert.head()
id text split timestamp content_type date speaker title decision rate ... recent_rate next_meeting next_decision next_rate text_num_words section_id sent_id pred_labels raw_preds pred_probs
0 0 The Secretary reported that advices of the ele... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 3.0 1993-02-18 0.0 3.0 47 29 0 neutral [2.2915966510772705, -0.9586986899375916, -2.4... 0.955035
1 0 By unanimous vote, the Committee elected the f... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 3.0 1993-02-18 0.0 3.0 73 35 0 neutral [1.692587971687317, -0.2049560397863388, -2.53... 0.858684
2 0 By unanimous vote, William J. McDonough, Marga... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 3.0 1993-02-18 0.0 3.0 74 37 0 neutral [1.8892569541931152, -0.3972317576408386, -2.6... 0.898655
3 0 On January 15, 1993, the continuing rules, reg... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 3.0 1993-02-18 0.0 3.0 59 39 0 neutral [2.335063934326172, -1.0927255153656006, -2.45... 0.960805
4 0 Members were asked to indicate if they wished ... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 3.0 1993-02-18 0.0 3.0 25 39 1 neutral [2.3842966556549072, -1.327033519744873, -2.36... 0.967973

5 rows × 22 columns

fomc_sent_sentiments_finbert = eKonf.load_data('fomc_sent_sentiments_finbert.parquet', data_dir)
fomc_sent_sentiments_finbert
id text split timestamp content_type date speaker title decision rate ... recent_rate next_meeting next_decision next_rate text_num_words section_id sent_id pred_labels raw_preds pred_probs
0 0 The Secretary reported that advices of the ele... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 3.00 1993-02-18 0.0 3.00 47 29 0 neutral [2.2915966510772705, -0.9586986899375916, -2.4... 0.955035
1 0 By unanimous vote, the Committee elected the f... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 3.00 1993-02-18 0.0 3.00 73 35 0 neutral [1.692587971687317, -0.2049560397863388, -2.53... 0.858684
2 0 By unanimous vote, William J. McDonough, Marga... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 3.00 1993-02-18 0.0 3.00 74 37 0 neutral [1.8892569541931152, -0.3972317576408386, -2.6... 0.898655
3 0 On January 15, 1993, the continuing rules, reg... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 3.00 1993-02-18 0.0 3.00 59 39 0 neutral [2.335063934326172, -1.0927255153656006, -2.45... 0.960805
4 0 Members were asked to indicate if they wished ... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.00 ... 3.00 1993-02-18 0.0 3.00 25 39 1 neutral [2.3842966556549072, -1.327033519744873, -2.36... 0.967973
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
653463 2854 It will not have the word “somewhat” on line 3. train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 ... 0.25 2015-01-28 0.0 0.25 10 287 2 neutral [2.3733553886413574, -1.3893874883651733, -2.2... 0.967620
653464 2854 Chair Yellen Yes Vice Chairman Dudley ... train 2014-12-17 fomc_meeting_script 2014-12-17 MR. LUECKE FOMC Meeting Transcript 0.0 0.25 ... 0.25 2015-01-28 0.0 0.25 31 287 4 neutral [2.224238872528076, -0.893900990486145, -2.421... 0.948905
653465 2854 And let me confirm that the next meeting will ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 0.25 2015-01-28 0.0 0.25 19 288 3 neutral [2.349479913711548, -1.391977071762085, -2.298... 0.967770
653466 2854 I believe box lunches are now available for pe... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 0.25 2015-01-28 0.0 0.25 33 288 4 neutral [2.311434030532837, -1.1415460109710693, -2.37... 0.960746
653467 2854 I will do my best, and I will consider at the ... train 2014-12-17 fomc_meeting_script 2014-12-17 CHAIR YELLEN FOMC Meeting Transcript 0.0 0.25 ... 0.25 2015-01-28 0.0 0.25 18 288 5 neutral [2.3932976722717285, -1.2607381343841553, -2.4... 0.967106

653468 rows × 22 columns

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sent_sentiments_finbert.parquet"
cfg.data.data_dir = data_dir
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.aggregate_scores._method_ = 'classification'
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_mean', 'polarity_diffusion', 'num_examples']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_sentiment_finbert_next.parquet'
tone_data_finbert = eKonf.instantiate(cfg)
tone_data_finbert = eKonf.to_datetime(tone_data_finbert, _columns='date')
tone_data_finbert = tone_data_finbert.set_index('date')
eKonf.save_data(tone_data_finbert, 'fomc_tone_data_finbert.parquet', data_dir)
tone_data_finbert = eKonf.load_data('fomc_tone_data_finbert.parquet', data_dir)

cols = [
    'polarity_mean_minutes', 'polarity_mean_press_conf', 'polarity_mean_speech', 'polarity_mean_statement',
    'polarity_diffusion_minutes', 'polarity_diffusion_press_conf', 'polarity_diffusion_speech', 'polarity_diffusion_statement',
]

tone_data_finbert =  tone_data_finbert[cols].copy()
tone_data_finbert.columns = tone_data_finbert.columns.str.replace('polarity', 'finbert')
tone_data_finbert
finbert_mean_minutes finbert_mean_press_conf finbert_mean_speech finbert_mean_statement finbert_diffusion_minutes finbert_diffusion_press_conf finbert_diffusion_speech finbert_diffusion_statement
date
1990-02-07 NaN NaN NaN NaN NaN NaN NaN NaN
1990-03-27 NaN NaN NaN NaN NaN NaN NaN NaN
1990-05-15 NaN NaN NaN NaN NaN NaN NaN NaN
1990-07-03 NaN NaN NaN NaN NaN NaN NaN NaN
1990-08-21 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
2021-11-30 NaN NaN 0.182338 NaN NaN NaN 0.239583 NaN
2021-12-01 NaN NaN NaN NaN NaN NaN NaN NaN
2021-12-02 NaN NaN 0.262141 NaN NaN NaN 0.338889 NaN
2021-12-15 0.509806 0.280516 NaN 0.412947 0.675 0.377622 NaN 0.555556
2021-12-17 NaN NaN 0.408242 NaN NaN NaN 0.547619 NaN

1876 rows × 8 columns

Predict sentiments with the T5#

ds_cfg = eKonf.compose('dataset')
ds_cfg.name = 'financial_phrasebank'
ds_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/financial_phrasebank.zip'
ds_cfg.data_dir = ds_cfg.path.cached_path
ds_cfg.verbose = False

overrides=[
    '+model/transformer=t5_classification_with_simple',
    '+model/transformer/pretrained=t5-base',
]
model_cfg = eKonf.compose('model/transformer=t5_classification_with_simple', overrides)
model_cfg.name = 'fomc_t5'
model_cfg.dataset = ds_cfg
model_cfg.verbose = False
model_cfg.config.num_train_epochs = 2
model_cfg.config.max_seq_length = 256
model_cfg.config.train_batch_size = 8
model_cfg.config.eval_batch_size = 8
model_cfg._method_ = ['train', 'eval']
# model_cfg._method_ = ['eval']
t5_model = eKonf.instantiate(model_cfg)
/opt/conda/lib/python3.8/site-packages/transformers/models/t5/tokenization_t5.py:164: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3557: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)
wandb: Currently logged in as: entelecheia. Use `wandb login --relogin` to force relogin
wandb version 0.12.20 is available! To upgrade, please run: $ pip install wandb --upgrade
Tracking run with wandb version 0.12.19
Run data is saved locally in /workspace/projects/ekorpkit-book/outputs/fomc_t5/t5-base/wandb/run-20220701_020155-3fyvrvgf
/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3557: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)
{'eval_loss': 0.06258307859733529}
Accuracy:  0.9491150442477876
Precison:  0.9495910765330818
Recall:  0.9491150442477876
F1 Score:  0.9474806981643329
Model Report: 
___________________________________________________
              precision    recall  f1-score   support

    negative       0.96      0.75      0.84        61
     neutral       0.97      1.00      0.98       277
    positive       0.90      0.94      0.92       114

    accuracy                           0.95       452
   macro avg       0.94      0.90      0.91       452
weighted avg       0.95      0.95      0.95       452
../../../_images/51d1f0d6180c5d2dcea5cfea86ea9f6030487a17f3441977eb0c59ff3412f185.png
model_cfg._method_ = []
cfg = eKonf.compose(config_group='pipeline')
cfg.name = 'fomc_sent_sentiments'
cfg.data_dir = data_dir
cfg.data_file = "fomc_sents.parquet"
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = data_dir
cfg.predict.output_file = f'{cfg.name}_t5.parquet'
fomc_sent_sentiments_t5 = eKonf.instantiate(cfg)
fomc_sent_sentiments_t5.head()
/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3557: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)
id text split timestamp content_type date speaker title decision rate ... recent_decision recent_rate next_meeting next_decision next_rate text_num_words section_id sent_id prefix pred_labels
0 0 The Secretary reported that advices of the ele... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 0.0 3.0 1993-02-18 0.0 3.0 47 29 0 classification neutral
1 0 By unanimous vote, the Committee elected the f... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 0.0 3.0 1993-02-18 0.0 3.0 73 35 0 classification neutral
2 0 By unanimous vote, William J. McDonough, Marga... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 0.0 3.0 1993-02-18 0.0 3.0 74 37 0 classification neutral
3 0 On January 15, 1993, the continuing rules, reg... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 0.0 3.0 1993-02-18 0.0 3.0 59 39 0 classification neutral
4 0 Members were asked to indicate if they wished ... train 1993-02-03 fomc_minutes 1993-02-03 Alan Greenspan FOMC Meeting Minutes 0.0 3.0 ... 0.0 3.0 1993-02-18 0.0 3.0 25 39 1 classification neutral

5 rows × 21 columns

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sent_sentiments_t5.parquet"
cfg.data.data_dir = data_dir
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.aggregate_scores._method_ = 'classification_t5'
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_diffusion', 'num_examples']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_sentiment_t5_next.parquet'
tone_data_t5 = eKonf.instantiate(cfg)
tone_data_t5 = eKonf.to_datetime(tone_data_t5, _columns='date')
tone_data_t5 = tone_data_t5.set_index('date')
eKonf.save_data(tone_data_t5, 'fomc_tone_data_t5.parquet', data_dir)
tone_data_t5 = eKonf.load_data('fomc_tone_data_t5.parquet', data_dir)

cols = [
    'polarity_diffusion_minutes', 'polarity_diffusion_press_conf', 'polarity_diffusion_speech', 'polarity_diffusion_statement',
]

tone_data_t5 =  tone_data_t5[cols].copy()
tone_data_t5.columns = tone_data_t5.columns.str.replace('polarity', 't5')
tone_data_t5
t5_diffusion_minutes t5_diffusion_press_conf t5_diffusion_speech t5_diffusion_statement
date
1990-02-07 NaN NaN NaN NaN
1990-03-27 NaN NaN NaN NaN
1990-05-15 NaN NaN NaN NaN
1990-07-03 NaN NaN NaN NaN
1990-08-21 NaN NaN NaN NaN
... ... ... ... ...
2021-11-30 NaN NaN 0.239583 NaN
2021-12-01 NaN NaN NaN NaN
2021-12-02 NaN NaN 0.250000 NaN
2021-12-15 0.403571 0.216783 NaN 0.444444
2021-12-17 NaN NaN 0.174603 NaN

1876 rows × 4 columns