Predicting Sentiments of FOMC Corpus

Predicting Sentiments of FOMC Corpus#

Analyse statement by Loughran and McDonald dictionary, finbert, and T5 model

%config InlineBackend.figure_format='retina'
from ekorpkit import eKonf

logging.basicConfig(level=logging.WARNING)
print("version:", eKonf.__version__)
print("is notebook?", eKonf.is_notebook())
print("is colab?", eKonf.is_colab())
print("evironment varialbles:")
eKonf.print(eKonf.env().dict())

version: 0.1.33+28.g90d1dea
is notebook? True
is colab? False
evironment varialbles:
{'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_WORKSPACE_ROOT': '/workspace',
 'NUM_WORKERS': 230}

start_year = 2000
data_dir = "../data/fomc"
eKonf.env().FRED_API_KEY

pydantic.types.SecretStr

Predict sentiments with the LM sentiment analyser#

Load FOMC Corpus#

fomc_sents = eKonf.load_data("fomc_sents.parquet", data_dir)
fomc_sents.tail()

	id	text	split	timestamp	content_type	date	speaker	title	rate	recent_meeting	recent_rate	next_meeting	next_rate	text_num_words	section_id	sent_id
653463	2854	It will not have the word “somewhat” on line 3.	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.25	2014-12-17	0.25	2015-01-28	0.25	10	287	2
653464	2854	Chair Yellen Yes Vice Chairman Dudley ...	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.25	2014-12-17	0.25	2015-01-28	0.25	31	287	4
653465	2854	And let me confirm that the next meeting will ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.25	2014-12-17	0.25	2015-01-28	0.25	19	288	3
653466	2854	I believe box lunches are now available for pe...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.25	2014-12-17	0.25	2015-01-28	0.25	33	288	4
653467	2854	I will do my best, and I will consider at the ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.25	2014-12-17	0.25	2015-01-28	0.25	18	288	5

Predict sentiments of sentences#

model_cfg = eKonf.compose('model/sentiment=lm')
model_cfg.num_workers = 100
lmsa = eKonf.instantiate(model_cfg)

article = fomc_sents.text[10]
lmsa.predict_sentence(article)

{'num_tokens': 156,
 'polarity': -0.9999990000010001,
 'polarity_label': 'negative',
 'uncertainty': 1e-06}

fomc_sent_sentiments = lmsa.predict(fomc_sents)
eKonf.save_data(fomc_sent_sentiments, "fomc_sent_sentiments.parquet", data_dir)

INFO:ekorpkit.models.sentiment.lbsa:Predicting sentiments of the column [text] using predict_sentence
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 100  input_split: False  merge_output: True  len(data): 653468 len(args): 5
Predicting [text]: 100%|██████████| 654/654 [02:50<00:00,  3.83it/s]
INFO:ekorpkit.models.sentiment.lbsa: >> elapsed time to predict: 0:02:51.902053

	id	text	split	timestamp	content_type	date	speaker	title	decision	rate	...	next_meeting	next_decision	next_rate	text_num_words	section_id	sent_id	num_tokens	polarity	polarity_label	uncertainty
0	0	The Secretary reported that advices of the ele...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	47	29	0	52	0.000000	neutral	0.000001
1	0	By unanimous vote, the Committee elected the f...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	73	35	0	78	-1.000000	negative	0.000001
2	0	By unanimous vote, William J. McDonough, Marga...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	74	37	0	83	1.000000	positive	0.000001
3	0	On January 15, 1993, the continuing rules, reg...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	59	39	0	68	-0.333333	negative	0.014707
4	0	Members were asked to indicate if they wished ...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	25	39	1	26	-0.999999	negative	0.000001
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
653463	2854	It will not have the word “somewhat” on line 3.	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	10	287	2	13	0.000000	neutral	0.076924
653464	2854	Chair Yellen Yes Vice Chairman Dudley ...	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	31	287	4	31	0.000000	neutral	0.000001
653465	2854	And let me confirm that the next meeting will ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	19	288	3	21	0.000000	neutral	0.000001
653466	2854	I believe box lunches are now available for pe...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	33	288	4	39	0.000000	neutral	0.025642
653467	2854	I will do my best, and I will consider at the ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	18	288	5	22	0.000000	neutral	0.000001

653468 rows × 23 columns

fomc_sent_sentiments = eKonf.load_data("fomc_sent_sentiments.parquet", data_dir)
fomc_sent_sentiments

	id	text	split	timestamp	content_type	date	speaker	title	decision	rate	...	next_meeting	next_decision	next_rate	text_num_words	section_id	sent_id	num_tokens	polarity	polarity_label	uncertainty
0	0	The Secretary reported that advices of the ele...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	47	29	0	52	0.000000	neutral	0.000001
1	0	By unanimous vote, the Committee elected the f...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	73	35	0	78	-1.000000	negative	0.000001
2	0	By unanimous vote, William J. McDonough, Marga...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	74	37	0	83	1.000000	positive	0.000001
3	0	On January 15, 1993, the continuing rules, reg...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	59	39	0	68	-0.333333	negative	0.014707
4	0	Members were asked to indicate if they wished ...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	1993-02-18	0.0	3.00	25	39	1	26	-0.999999	negative	0.000001
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
653463	2854	It will not have the word “somewhat” on line 3.	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	10	287	2	13	0.000000	neutral	0.076924
653464	2854	Chair Yellen Yes Vice Chairman Dudley ...	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	31	287	4	31	0.000000	neutral	0.000001
653465	2854	And let me confirm that the next meeting will ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	19	288	3	21	0.000000	neutral	0.000001
653466	2854	I believe box lunches are now available for pe...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	33	288	4	39	0.000000	neutral	0.025642
653467	2854	I will do my best, and I will consider at the ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	2015-01-28	0.0	0.25	18	288	5	22	0.000000	neutral	0.000001

653468 rows × 23 columns

Aggregate sentiment scores#

fomc_tones_lm = lmsa.aggregate_scores(fomc_sent_sentiments, groupby=['content_type', 'date'])
fomc_tones_lm.content_type = fomc_tones_lm.content_type.str.replace('fomc_', '')
eKonf.save_data(fomc_tones_lm, 'fomc_tones_lm.parquet', data_dir)

fomc_tones_lm = eKonf.load_data('fomc_tones_lm.parquet', data_dir)
fomc_tones_lm

	content_type	date	polarity_mean	polarity_diffusion	positive	negative	num_tokens_sum	num_tokens_mean	num_tokens_median	num_examples	polarity_mean_label	polarity_diffusion_label
0	beigebook	2021-01-13	-0.068247	-0.071053	3972	5229	397855	22.489119	21.0	17691	neutral	neutral
1	beigebook	2021-03-03	-0.032683	-0.037384	3887	4518	381300	22.590201	21.0	16879	neutral	neutral
2	beigebook	2021-04-14	-0.030535	-0.035085	3568	4100	340121	22.430983	21.0	15163	neutral	neutral
3	beigebook	2021-06-02	-0.039873	-0.044760	3837	4561	357745	22.117156	21.0	16175	neutral	neutral
4	beigebook	2021-07-14	0.011743	0.012195	191	182	15803	21.413279	21.0	738	neutral	neutral
...	...	...	...	...	...	...	...	...	...	...	...	...
2390	testimony	2021-05-19	-0.171642	-0.164179	15	26	2590	38.656716	29.0	67	negative	negative
2391	testimony	2021-06-22	0.220126	0.207547	20	9	1456	27.471698	24.0	53	positive	positive
2392	testimony	2021-07-14	0.259259	0.305556	14	3	1066	29.611111	27.0	36	positive	positive
2393	testimony	2021-09-28	0.031746	0.000000	9	9	1026	24.428571	22.0	42	neutral	neutral
2394	testimony	2021-11-30	-0.120000	-0.200000	3	7	556	27.800000	26.0	20	neutral	negative

2395 rows × 12 columns

cfg = eKonf.compose('pipeline/pivot')
cfg.index = 'date'
cfg.columns = 'content_type'
cfg.values = ['polarity_mean', 'polarity_diffusion', 'num_examples', 'num_tokens_sum', 'num_tokens_mean']
tone_data_lm = eKonf.pipe(fomc_tones_lm, cfg)
tone_data_lm = eKonf.to_datetime(tone_data_lm, _columns='date')
tone_data_lm = tone_data_lm.set_index('date')
eKonf.save_data(tone_data_lm, 'fomc_tone_data_lm.parquet', data_dir)

tone_data_lm = eKonf.load_data('fomc_tone_data_lm.parquet', data_dir)
tone_data_lm

	polarity_mean_beigebook	polarity_mean_meeting_script	polarity_mean_minutes	polarity_mean_press_conf	polarity_mean_speech	polarity_mean_statement	polarity_mean_testimony	polarity_diffusion_beigebook	polarity_diffusion_meeting_script	polarity_diffusion_minutes	...	num_tokens_sum_speech	num_tokens_sum_statement	num_tokens_sum_testimony	num_tokens_mean_beigebook	num_tokens_mean_meeting_script	num_tokens_mean_minutes	num_tokens_mean_press_conf	num_tokens_mean_speech	num_tokens_mean_statement	num_tokens_mean_testimony
date
1990-02-07	NaN	-0.087583	NaN	NaN	NaN	NaN	NaN	NaN	-0.095663	NaN	...	NaN	NaN	NaN	NaN	30.213010	NaN	NaN	NaN	NaN	NaN
1990-03-27	NaN	-0.171992	NaN	NaN	NaN	NaN	NaN	NaN	-0.179702	NaN	...	NaN	NaN	NaN	NaN	29.846369	NaN	NaN	NaN	NaN	NaN
1990-05-15	NaN	-0.116052	NaN	NaN	NaN	NaN	NaN	NaN	-0.125461	NaN	...	NaN	NaN	NaN	NaN	29.749077	NaN	NaN	NaN	NaN	NaN
1990-07-03	NaN	-0.114829	NaN	NaN	NaN	NaN	NaN	NaN	-0.117794	NaN	...	NaN	NaN	NaN	NaN	29.667920	NaN	NaN	NaN	NaN	NaN
1990-08-21	NaN	-0.209552	NaN	NaN	NaN	NaN	NaN	NaN	-0.219403	NaN	...	NaN	NaN	NaN	NaN	31.032836	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2021-11-30	NaN	NaN	NaN	NaN	-0.167014	NaN	-0.12	NaN	NaN	NaN	...	3066.0	NaN	556.0	NaN	NaN	NaN	NaN	31.937500	NaN	27.8
2021-12-01	-0.046022	NaN	NaN	NaN	NaN	NaN	NaN	-0.048109	NaN	NaN	...	NaN	NaN	NaN	22.539497	NaN	NaN	NaN	NaN	NaN	NaN
2021-12-02	NaN	NaN	NaN	NaN	-0.077381	NaN	NaN	NaN	NaN	NaN	...	6514.0	NaN	NaN	NaN	NaN	NaN	NaN	36.188889	NaN	NaN
2021-12-15	NaN	NaN	-0.043929	-0.075441	NaN	0.166667	NaN	NaN	NaN	-0.064286	...	NaN	489.0	NaN	NaN	NaN	30.521429	37.587413	NaN	27.166667	NaN
2021-12-17	NaN	NaN	NaN	NaN	-0.356613	NaN	NaN	NaN	NaN	NaN	...	3694.0	NaN	NaN	NaN	NaN	NaN	NaN	29.317460	NaN	NaN

1876 rows × 35 columns

Predict sentiments and aggregate scores with a pipeline#

model_cfg = eKonf.compose('model/sentiment=lm')

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['predict', 'aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sents.parquet"
cfg.data.data_dir = data_dir
cfg.predict.model = model_cfg
cfg.predict.path.output.base_dir = data_dir
cfg.predict.path.output.filename = "fomc_sent_sentiments.parquet"
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_mean', 'polarity_diffusion', 'num_examples', 'num_tokens_sum', 'num_tokens_mean']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_tone_data_lm.parquet'
tone_data_lm = eKonf.instantiate(cfg)
tone_data_lm = eKonf.to_datetime(tone_data_lm, _columns='date')
tone_data_lm = tone_data_lm.set_index('date')
eKonf.save_data(tone_data_lm, 'fomc_tone_data_lm.parquet', data_dir)

INFO:ekorpkit.io.file:Processing [1] files from ['fomc_sents.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['../data/fomc/fomc_sents.parquet']
INFO:ekorpkit.io.file:Loading data from ../data/fomc/fomc_sents.parquet
INFO:ekorpkit.pipelines.pipe:Applying pipeline: OrderedDict([('predict', 'predict'), ('aggregate_scores', 'aggregate_scores'), ('replace', 'replace'), ('pivot', 'pivot'), ('save_dataframe', 'save_dataframe')])
INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f0abf86c820>)
INFO:ekorpkit.preprocessors.tokenizer:instantiating ekorpkit.preprocessors.stopwords.Stopwords...
INFO:ekorpkit.base:Calling load_candidates
INFO:ekorpkit.io.file:Processing [1] files from ['/workspace/projects/ekorpkit/ekorpkit/resources/lexicons/LM.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/workspace/projects/ekorpkit/ekorpkit/resources/lexicons/LM.parquet']
INFO:ekorpkit.io.file:Loading data from /workspace/projects/ekorpkit/ekorpkit/resources/lexicons/LM.parquet
INFO:ekorpkit.models.ngram.ngram:loaded 58142 candidates
INFO:ekorpkit.models.sentiment.lbsa:Predicting sentiments of the column [text] using predict_sentence
INFO:ekorpkit.base:Using batcher with minibatch size: 1000
INFO:ekorpkit.utils.batch.batcher: backend: joblib  minibatch_size: 1000  procs: 100  input_split: False  merge_output: True  len(data): 653468 len(args): 5
Predicting [text]: 100%|██████████| 654/654 [03:06<00:00,  3.51it/s]
INFO:ekorpkit.models.sentiment.lbsa: >> elapsed time to predict: 0:03:07.279688
INFO:ekorpkit.pipelines.pipe:Saving data to: {'file': None, 'filename': 'fomc_sent_sentiments.parquet', 'base_dir': '../data/fomc', 'filetype': '', 'columns': None, 'suffix': None, 'filepath': '../data/fomc/fomc_sent_sentiments.parquet'}
INFO:ekorpkit.io.file:Saving dataframe to ../data/fomc/fomc_sent_sentiments.parquet
INFO:ekorpkit.base:Applying pipe: functools.partial(<function aggregate_scores at 0x7f0abf86c8b0>)
INFO:ekorpkit.base:instantiating ekorpkit.models.sentiment.base.BaseSentimentAnalyser...
INFO:ekorpkit.pipelines.pipe:filename not specified
INFO:ekorpkit.base:Applying pipe: functools.partial(<function general_function at 0x7f0abf86cc10>)
INFO:ekorpkit.pipelines.pipe:processing column: content_type
INFO:ekorpkit.pipelines.pipe: >> elapsed time to replace: 0:00:00.003644
INFO:ekorpkit.base:Applying pipe: functools.partial(<function pivot at 0x7f0abf86c1f0>)
INFO:ekorpkit.base:Applying pipe: functools.partial(<function save_dataframe at 0x7f0abf7043a0>)
INFO:ekorpkit.io.file:Saving dataframe to ../data/fomc/fomc_sentiment_data.parquet

	polarity_mean_beigebook	polarity_mean_meeting_script	polarity_mean_minutes	polarity_mean_press_conf	polarity_mean_speech	polarity_mean_statement	polarity_mean_testimony	polarity_diffusion_beigebook	polarity_diffusion_meeting_script	polarity_diffusion_minutes	...	num_tokens_sum_speech	num_tokens_sum_statement	num_tokens_sum_testimony	num_tokens_mean_beigebook	num_tokens_mean_meeting_script	num_tokens_mean_minutes	num_tokens_mean_press_conf	num_tokens_mean_speech	num_tokens_mean_statement	num_tokens_mean_testimony
recent_meeting
1990-02-07	NaN	-0.087583	NaN	NaN	NaN	NaN	NaN	NaN	-0.095663	NaN	...	NaN	NaN	NaN	NaN	30.213010	NaN	NaN	NaN	NaN	NaN
1990-03-27	NaN	-0.171992	NaN	NaN	NaN	NaN	NaN	NaN	-0.179702	NaN	...	NaN	NaN	NaN	NaN	29.846369	NaN	NaN	NaN	NaN	NaN
1990-05-15	NaN	-0.116052	NaN	NaN	NaN	NaN	NaN	NaN	-0.125461	NaN	...	NaN	NaN	NaN	NaN	29.749077	NaN	NaN	NaN	NaN	NaN
1990-07-03	NaN	-0.114829	NaN	NaN	NaN	NaN	NaN	NaN	-0.117794	NaN	...	NaN	NaN	NaN	NaN	29.667920	NaN	NaN	NaN	NaN	NaN
1990-08-21	NaN	-0.209552	NaN	NaN	NaN	NaN	NaN	NaN	-0.219403	NaN	...	NaN	NaN	NaN	NaN	31.032836	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2021-06-16	0.011743	NaN	0.041638	-0.017544	-0.031786	0.435897	0.235955	0.012195	NaN	0.031142	...	6894.0	384.0	2522.0	21.413279	NaN	30.615917	27.100000	29.088608	29.538462	28.337079
2021-07-28	-0.120547	NaN	-0.043969	0.021318	-0.042941	0.461538	NaN	-0.134921	NaN	-0.069079	...	13170.0	399.0	NaN	20.980159	NaN	32.888158	27.261628	29.072848	30.692308	NaN
2021-09-22	-0.074328	NaN	-0.079199	-0.087292	-0.133837	0.476190	0.031746	-0.075712	NaN	-0.112403	...	31138.0	419.0	1026.0	22.957808	NaN	31.348837	25.809384	29.431002	29.928571	24.428571
2021-11-03	-0.046022	NaN	-0.064255	-0.089881	-0.030345	0.215686	-0.120000	-0.048109	NaN	-0.080851	...	26096.0	538.0	556.0	22.539497	NaN	31.880851	27.720238	31.980392	31.647059	27.800000
2021-12-15	NaN	NaN	-0.043929	-0.075441	-0.356613	0.166667	NaN	NaN	NaN	-0.064286	...	3694.0	489.0	NaN	NaN	NaN	30.521429	37.587413	29.317460	27.166667	NaN

286 rows × 35 columns

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sent_sentiments.parquet"
cfg.data.data_dir = data_dir
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_mean', 'polarity_diffusion', 'num_examples', 'num_tokens_sum', 'num_tokens_mean']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_tone_data_lm.parquet'
tone_data_lm = eKonf.instantiate(cfg)
tone_data_lm = eKonf.to_datetime(tone_data_lm, _columns='date')
tone_data_lm = tone_data_lm.set_index('date')
eKonf.save_data(tone_data_lm, 'fomc_tone_data_lm.parquet', data_dir)

Predict sentiments with the finbert#

ds_cfg = eKonf.compose('dataset')
ds_cfg.name = 'financial_phrasebank'
ds_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/financial_phrasebank.zip'
ds_cfg.data_dir = ds_cfg.path.cached_path
ds_cfg.verbose = False

overrides=[
    '+model/transformer=classification',
    '+model/transformer/pretrained=finbert',
]
model_cfg = eKonf.compose('model/transformer=classification', overrides)
model_cfg.name = 'fomc_finbert'
model_cfg.dataset = ds_cfg
model_cfg.verbose = False
model_cfg.config.num_train_epochs = 2
model_cfg.config.max_seq_length = 256
model_cfg.config.train_batch_size = 32
model_cfg.config.eval_batch_size = 32
model_cfg._method_ = ['eval']
# model_cfg._method_ = ['train', 'eval']
finbert_model = eKonf.instantiate(model_cfg)

Accuracy:  0.8960176991150443
Precison:  0.8957698961367982
Recall:  0.8960176991150443
F1 Score:  0.8958088377661515
Model Report: 
___________________________________________________
              precision    recall  f1-score   support

    negative       0.81      0.77      0.79        61
     neutral       0.96      0.96      0.96       277
    positive       0.79      0.81      0.80       114

    accuracy                           0.90       452
   macro avg       0.85      0.85      0.85       452
weighted avg       0.90      0.90      0.90       452

../../../_images/b2e9a96506f9e457c17cf16198c49396b93e29bbc2fd153bdd62aa9c587f3a2e.png

model_cfg._method_ = []
cfg = eKonf.compose(config_group='pipeline')
cfg.name = 'fomc_sent_sentiments'
cfg.data_dir = data_dir
cfg.data_file = "fomc_sents.parquet"
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = data_dir
cfg.predict.output_file = f'{cfg.name}_finbert.parquet'
fomc_sent_sentiments_finbert = eKonf.instantiate(cfg)
fomc_sent_sentiments_finbert.head()

	text	split	timestamp	content_type	date	speaker	title	rate	...	recent_rate	next_meeting	next_rate	text_num_words	section_id	sent_id	pred_labels	raw_preds	pred_probs
0	The Secretary reported that advices of the ele...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	47	29	0	neutral	[2.2915966510772705, -0.9586986899375916, -2.4...	0.955035
1	By unanimous vote, the Committee elected the f...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	73	35	0	neutral	[1.692587971687317, -0.2049560397863388, -2.53...	0.858684
2	By unanimous vote, William J. McDonough, Marga...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	74	37	0	neutral	[1.8892569541931152, -0.3972317576408386, -2.6...	0.898655
3	On January 15, 1993, the continuing rules, reg...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	59	39	0	neutral	[2.335063934326172, -1.0927255153656006, -2.45...	0.960805
4	Members were asked to indicate if they wished ...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	25	39	1	neutral	[2.3842966556549072, -1.327033519744873, -2.36...	0.967973

5 rows × 22 columns

fomc_sent_sentiments_finbert = eKonf.load_data('fomc_sent_sentiments_finbert.parquet', data_dir)
fomc_sent_sentiments_finbert

	id	text	split	timestamp	content_type	date	speaker	title	decision	rate	...	recent_rate	next_meeting	next_decision	next_rate	text_num_words	section_id	sent_id	pred_labels	raw_preds	pred_probs
0	0	The Secretary reported that advices of the ele...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	3.00	1993-02-18	0.0	3.00	47	29	0	neutral	[2.2915966510772705, -0.9586986899375916, -2.4...	0.955035
1	0	By unanimous vote, the Committee elected the f...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	3.00	1993-02-18	0.0	3.00	73	35	0	neutral	[1.692587971687317, -0.2049560397863388, -2.53...	0.858684
2	0	By unanimous vote, William J. McDonough, Marga...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	3.00	1993-02-18	0.0	3.00	74	37	0	neutral	[1.8892569541931152, -0.3972317576408386, -2.6...	0.898655
3	0	On January 15, 1993, the continuing rules, reg...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	3.00	1993-02-18	0.0	3.00	59	39	0	neutral	[2.335063934326172, -1.0927255153656006, -2.45...	0.960805
4	0	Members were asked to indicate if they wished ...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	0.0	3.00	...	3.00	1993-02-18	0.0	3.00	25	39	1	neutral	[2.3842966556549072, -1.327033519744873, -2.36...	0.967973
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
653463	2854	It will not have the word “somewhat” on line 3.	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.0	0.25	...	0.25	2015-01-28	0.0	0.25	10	287	2	neutral	[2.3733553886413574, -1.3893874883651733, -2.2...	0.967620
653464	2854	Chair Yellen Yes Vice Chairman Dudley ...	train	2014-12-17	fomc_meeting_script	2014-12-17	MR. LUECKE	FOMC Meeting Transcript	0.0	0.25	...	0.25	2015-01-28	0.0	0.25	31	287	4	neutral	[2.224238872528076, -0.893900990486145, -2.421...	0.948905
653465	2854	And let me confirm that the next meeting will ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	0.25	2015-01-28	0.0	0.25	19	288	3	neutral	[2.349479913711548, -1.391977071762085, -2.298...	0.967770
653466	2854	I believe box lunches are now available for pe...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	0.25	2015-01-28	0.0	0.25	33	288	4	neutral	[2.311434030532837, -1.1415460109710693, -2.37...	0.960746
653467	2854	I will do my best, and I will consider at the ...	train	2014-12-17	fomc_meeting_script	2014-12-17	CHAIR YELLEN	FOMC Meeting Transcript	0.0	0.25	...	0.25	2015-01-28	0.0	0.25	18	288	5	neutral	[2.3932976722717285, -1.2607381343841553, -2.4...	0.967106

653468 rows × 22 columns

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sent_sentiments_finbert.parquet"
cfg.data.data_dir = data_dir
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.aggregate_scores._method_ = 'classification'
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_mean', 'polarity_diffusion', 'num_examples']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_sentiment_finbert_next.parquet'
tone_data_finbert = eKonf.instantiate(cfg)
tone_data_finbert = eKonf.to_datetime(tone_data_finbert, _columns='date')
tone_data_finbert = tone_data_finbert.set_index('date')
eKonf.save_data(tone_data_finbert, 'fomc_tone_data_finbert.parquet', data_dir)

tone_data_finbert = eKonf.load_data('fomc_tone_data_finbert.parquet', data_dir)

cols = [
    'polarity_mean_minutes', 'polarity_mean_press_conf', 'polarity_mean_speech', 'polarity_mean_statement',
    'polarity_diffusion_minutes', 'polarity_diffusion_press_conf', 'polarity_diffusion_speech', 'polarity_diffusion_statement',
]

tone_data_finbert =  tone_data_finbert[cols].copy()
tone_data_finbert.columns = tone_data_finbert.columns.str.replace('polarity', 'finbert')
tone_data_finbert

	finbert_mean_minutes	finbert_mean_press_conf	finbert_mean_speech	finbert_mean_statement	finbert_diffusion_minutes	finbert_diffusion_press_conf	finbert_diffusion_speech	finbert_diffusion_statement
date
1990-02-07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1990-03-27	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1990-05-15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1990-07-03	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1990-08-21	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...
2021-11-30	NaN	NaN	0.182338	NaN	NaN	NaN	0.239583	NaN
2021-12-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2021-12-02	NaN	NaN	0.262141	NaN	NaN	NaN	0.338889	NaN
2021-12-15	0.509806	0.280516	NaN	0.412947	0.675	0.377622	NaN	0.555556
2021-12-17	NaN	NaN	0.408242	NaN	NaN	NaN	0.547619	NaN

1876 rows × 8 columns

Predict sentiments with the T5#

ds_cfg = eKonf.compose('dataset')
ds_cfg.name = 'financial_phrasebank'
ds_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/financial_phrasebank.zip'
ds_cfg.data_dir = ds_cfg.path.cached_path
ds_cfg.verbose = False

overrides=[
    '+model/transformer=t5_classification_with_simple',
    '+model/transformer/pretrained=t5-base',
]
model_cfg = eKonf.compose('model/transformer=t5_classification_with_simple', overrides)
model_cfg.name = 'fomc_t5'
model_cfg.dataset = ds_cfg
model_cfg.verbose = False
model_cfg.config.num_train_epochs = 2
model_cfg.config.max_seq_length = 256
model_cfg.config.train_batch_size = 8
model_cfg.config.eval_batch_size = 8
model_cfg._method_ = ['train', 'eval']
# model_cfg._method_ = ['eval']
t5_model = eKonf.instantiate(model_cfg)

/opt/conda/lib/python3.8/site-packages/transformers/models/t5/tokenization_t5.py:164: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(

/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3557: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)

wandb: Currently logged in as: entelecheia. Use `wandb login --relogin` to force relogin

wandb version 0.12.20 is available! To upgrade, please run: $ pip install wandb --upgrade

Tracking run with wandb version 0.12.19

Run data is saved locally in /workspace/projects/ekorpkit-book/outputs/fomc_t5/t5-base/wandb/run-20220701_020155-3fyvrvgf

Syncing run wandering-moon-1 to Weights & Biases (docs)

/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3557: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)

{'eval_loss': 0.06258307859733529}

Accuracy:  0.9491150442477876
Precison:  0.9495910765330818
Recall:  0.9491150442477876
F1 Score:  0.9474806981643329
Model Report: 
___________________________________________________
              precision    recall  f1-score   support

    negative       0.96      0.75      0.84        61
     neutral       0.97      1.00      0.98       277
    positive       0.90      0.94      0.92       114

    accuracy                           0.95       452
   macro avg       0.94      0.90      0.91       452
weighted avg       0.95      0.95      0.95       452

../../../_images/51d1f0d6180c5d2dcea5cfea86ea9f6030487a17f3441977eb0c59ff3412f185.png

model_cfg._method_ = []
cfg = eKonf.compose(config_group='pipeline')
cfg.name = 'fomc_sent_sentiments'
cfg.data_dir = data_dir
cfg.data_file = "fomc_sents.parquet"
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = data_dir
cfg.predict.output_file = f'{cfg.name}_t5.parquet'
fomc_sent_sentiments_t5 = eKonf.instantiate(cfg)
fomc_sent_sentiments_t5.head()

/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3557: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)

	text	split	timestamp	content_type	date	speaker	title	rate	...	recent_rate	next_meeting	next_rate	text_num_words	section_id	sent_id	prefix	pred_labels
0	The Secretary reported that advices of the ele...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	47	29	0	classification	neutral
1	By unanimous vote, the Committee elected the f...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	73	35	0	classification	neutral
2	By unanimous vote, William J. McDonough, Marga...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	74	37	0	classification	neutral
3	On January 15, 1993, the continuing rules, reg...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	59	39	0	classification	neutral
4	Members were asked to indicate if they wished ...	train	1993-02-03	fomc_minutes	1993-02-03	Alan Greenspan	FOMC Meeting Minutes	3.0	...	3.0	1993-02-18	3.0	25	39	1	classification	neutral

5 rows × 21 columns

cfg = eKonf.compose('pipeline')
cfg._pipeline_ = ['aggregate_scores', 'replace', 'pivot', 'save_dataframe']
cfg.num_workers = 100
cfg.data.data_file = "fomc_sent_sentiments_t5.parquet"
cfg.data.data_dir = data_dir
cfg.aggregate_scores.groupby = ['content_type', 'date']
cfg.aggregate_scores._method_ = 'classification_t5'
cfg.replace.apply_to = 'content_type'
cfg.replace.rcParams.to_replace = {'fomc_': ''}
cfg.replace.rcParams.regex = True
cfg.pivot.index = 'date'
cfg.pivot.columns = 'content_type'
cfg.pivot.values = ['polarity_diffusion', 'num_examples']
cfg.save_dataframe.output_dir = data_dir
cfg.save_dataframe.output_file = 'fomc_sentiment_t5_next.parquet'
tone_data_t5 = eKonf.instantiate(cfg)
tone_data_t5 = eKonf.to_datetime(tone_data_t5, _columns='date')
tone_data_t5 = tone_data_t5.set_index('date')
eKonf.save_data(tone_data_t5, 'fomc_tone_data_t5.parquet', data_dir)

tone_data_t5 = eKonf.load_data('fomc_tone_data_t5.parquet', data_dir)

cols = [
    'polarity_diffusion_minutes', 'polarity_diffusion_press_conf', 'polarity_diffusion_speech', 'polarity_diffusion_statement',
]

tone_data_t5 =  tone_data_t5[cols].copy()
tone_data_t5.columns = tone_data_t5.columns.str.replace('polarity', 't5')
tone_data_t5

	t5_diffusion_minutes	t5_diffusion_press_conf	t5_diffusion_speech	t5_diffusion_statement
date
1990-02-07	NaN	NaN	NaN	NaN
1990-03-27	NaN	NaN	NaN	NaN
1990-05-15	NaN	NaN	NaN	NaN
1990-07-03	NaN	NaN	NaN	NaN
1990-08-21	NaN	NaN	NaN	NaN
...	...	...	...	...
2021-11-30	NaN	NaN	0.239583	NaN
2021-12-01	NaN	NaN	NaN	NaN
2021-12-02	NaN	NaN	0.250000	NaN
2021-12-15	0.403571	0.216783	NaN	0.444444
2021-12-17	NaN	NaN	0.174603	NaN

1876 rows × 4 columns