Topic Modeling with Prior

Topic Modeling with Prior#

In the second stage of the analysis, the research employs a topic modeling approach with prior information to refine the topics pertinent to the study of central bank policy uncertainty in Cambodia’s highly dollarized economy. The topic modeling with prior represents a sophisticated approach to distilling relevant information from a corpus of text data. By incorporating prior knowledge, the model is tailored to capture the nuances of central bank policy uncertainty in the specific context of Cambodia’s economy. This method ensures that the derived topics are aligned with the research objectives, providing a robust foundation for subsequent analysis and interpretation.

Prior Information#

The prior information is set to guide the topic modeling towards specific themes relevant to central bank policy uncertainty. The prior consists of two main groups:

Group 0: Focuses on general economic indicators, including terms like ‘price’, ‘inflation’, ‘growth’, and ‘economy’.
Group 1: Concentrates on central banking aspects, with terms such as ‘nbc’, ‘central_bank’, ‘national_bank’, and ‘national_bank_cambodia’.

Configuration#

The configuration of the topic modeling with prior is as follows:

!nbcpu +model=nbcpu-topic_prior noop=1

Show code cell output Hide code cell output

## Command Line Interface for HyFI ##
{'about': {'authors': 'Young Joon Lee <entelecheia@hotmail.com>',
           'description': 'Quantifying Central Bank Policy Uncertainty in a '
                          'Highly Dollarized Economy: A Topic Modeling '
                          'Approach',
           'homepage': 'https://nbcpu.entelecheia.ai',
           'license': 'MIT',
           'name': 'Measuring Central Bank Policy Uncertainty'},
 'debug_mode': False,
 'dryrun': False,
 'hydra_log_dir': '/home/yjlee/.hyfi/logs/hydra',
 'ignore_warnings': True,
 'logging_level': 'WARNING',
 'model': {'_config_group_': '/model',
           '_config_name_': 'lda',
           '_target_': 'thematos.models.lda.LdaModel',
           'autosave': True,
           'batch': {'_config_group_': '/batch',
                     '_config_name_': '__init__',
                     'batch_name': 'model',
                     'batch_num': None,
                     'batch_num_auto': False,
                     'batch_root': 'workspace/topic',
                     'config_dirname': 'configs',
                     'config_json': 'config.json',
                     'config_yaml': 'config.yaml',
                     'device': 'cpu',
                     'num_devices': 1,
                     'num_workers': 1,
                     'output_extention': None,
                     'output_suffix': None,
                     'random_seed': False,
                     'resume_latest': False,
                     'resume_run': False,
                     'seed': -1,
                     'verbose': True},
           'batch_name': 'model',
           'coherence_metric_list': ['u_mass', 'c_uci', 'c_npmi', 'c_v'],
           'corpus': {'_config_group_': '/dataset',
                      '_config_name_': 'topic_corpus',
                      '_target_': 'thematos.datasets.corpus.Corpus',
                      'batch': {'_config_group_': '/batch',
                                '_config_name_': '__init__',
                                'batch_name': 'corpus',
                                'batch_num': None,
                                'batch_num_auto': False,
                                'batch_root': 'workspace/topic',
                                'config_dirname': 'configs',
                                'config_json': 'config.json',
                                'config_yaml': 'config.yaml',
                                'device': 'cpu',
                                'num_devices': 1,
                                'num_workers': 1,
                                'output_extention': None,
                                'output_suffix': None,
                                'random_seed': False,
                                'resume_latest': False,
                                'resume_run': False,
                                'seed': -1,
                                'verbose': True},
                      'batch_name': 'corpus',
                      'data_load': {'_target_': 'hyfi.utils.datasets.load.DSLoad.load_dataframe',
                                    'columns': None,
                                    'data_dir': None,
                                    'data_file': 'datasets/processed/topic_noprior_filtered/train.parquet',
                                    'filetype': None,
                                    'index_col': None,
                                    'verbose': False},
                      'id_col': 'id',
                      'module': None,
                      'ngramize': True,
                      'ngrams': {'_config_group_': '/ngrams',
                                 '_config_name_': 'tp_ngrams',
                                 '_target_': 'thematos.datasets.ngrams.NgramConfig',
                                 'delimiter': '_',
                                 'max_cand': 5000,
                                 'max_len': 3,
                                 'min_cf': 20,
                                 'min_df': 10,
                                 'min_score': 0.5,
                                 'normalized': True,
                                 'workers': 0},
                      'path': {'_config_name_': '__batch__',
                               'batch_name': 'corpus',
                               'task_name': 'topic',
                               'task_root': 'workspace'},
                      'pipelines': [],
                      'stopwords': {'_config_group_': '/stopwords',
                                    '_config_name_': '__init__',
                                    '_target_': 'lexikanon.stopwords.Stopwords',
                                    'lowercase': True,
                                    'name': 'stopwords',
                                    'nltk_stopwords_lang': None,
                                    'stopwords_fn': None,
                                    'stopwords_list': None,
                                    'stopwords_path': '/home/yjlee/.hyfi/logs/hydra/hyfi/2023-08-15/2023-08-15_18-29-12/tests/assets/stopwords/nbcpu-topic.txt',
                                    'verbose': True},
                      'task_name': 'topic',
                      'task_root': 'workspace',
                      'text_col': 'adjnouns',
                      'timestamp_col': 'time',
                      'verbose': True,
                      'version': '0.0.0'},
           'eval_coherence': True,
           'model_args': {'_config_group_': '/model/config',
                          '_config_name_': 'lda',
                          '_target_': 'thematos.models.config.LdaConfig',
                          'alpha': 0.1,
                          'eta': 0.01,
                          'k': 10,
                          'min_cf': 10,
                          'min_df': 10,
                          'rm_top': 0,
                          'tw': 1},
           'model_type': 'LDA',
           'module': None,
           'path': {'_config_name_': '__batch__',
                    'batch_name': 'model',
                    'task_name': 'topic',
                    'task_root': 'workspace'},
           'pipelines': [],
           'save_full': True,
           'set_wordprior': True,
           'task_name': 'topic',
           'task_root': 'workspace',
           'train_args': {'_config_group_': '/model/train',
                          '_config_name_': 'topic',
                          '_target_': 'thematos.models.config.TrainConfig',
                          'burn_in': 0,
                          'interval': 10,
                          'iterations': 100},
           'train_summary_args': {'_config_group_': '/model/summary',
                                  '_config_name_': 'topic_train',
                                  '_target_': 'thematos.models.config.TrainSummaryConfig',
                                  'flush': False,
                                  'initial_hp': True,
                                  'params': True,
                                  'topic_word_top_n': 10},
           'verbose': True,
           'version': '0.0.0',
           'wc_args': {'_config_group_': '/model/plot',
                       '_config_name_': 'wordcloud',
                       '_target_': 'thematos.models.config.WordcloudConfig',
                       'dpi': 300,
                       'figsize': None,
                       'fontpath': None,
                       'height_multiple': 2,
                       'make_collage': True,
                       'mask_dir': None,
                       'num_cols': 5,
                       'num_images_per_page': 20,
                       'num_rows': None,
                       'output_file_format': 'wordcloud_p{page_num:02d}.png',
                       'save': True,
                       'title_color': 'green',
                       'title_fontsize': 14,
                       'titles': None,
                       'top_n': 500,
                       'wc': {'_config_group_': '/plot',
                              '_config_name_': 'wordcloud',
                              '_target_': 'thematos.plots.wordcloud.WordCloud',
                              'background_color': 'black',
                              'collocation_threshold': 30,
                              'collocations': True,
                              'color_func': None,
                              'colormap': 'PuBu',
                              'contour_color': 'steelblue',
                              'contour_width': 0,
                              'font_path': None,
                              'font_step': 1,
                              'height': 200,
                              'include_numbers': False,
                              'mask': None,
                              'max_font_size': None,
                              'max_words': 200,
                              'min_font_size': 4,
                              'min_word_length': 0,
                              'mode': 'RGB',
                              'normalize_plurals': True,
                              'prefer_horizontal': 0.9,
                              'regexp': None,
                              'relative_scaling': 'auto',
                              'repeat': False,
                              'scale': 1,
                              'stopwords': None,
                              'width': 400},
                       'width_multiple': 4},
           'wordprior': {'_config_group_': '/words',
                         '_config_name_': 'wordprior',
                         '_target_': 'thematos.models.prior.WordPrior',
                         'data_file': '/home/yjlee/.hyfi/logs/hydra/hyfi/2023-08-15/2023-08-15_18-29-12/tests/assets/words/word_prior.yaml',
                         'lowercase': True,
                         'max_prior_weight': 1.0,
                         'min_prior_weight': 0.01,
                         'prior_data': None,
                         'verbose': True}},
 'noop': 1,
 'resolve': True,
 'verbose': False,
 'version': '0.14.0'}

Dryrun is enabled, not running the HyFI config

Running the Workflow#

The entire workflow can be executed using the following command:

!nbcpu +workflow=nbcpu tasks='[nbcpu-topic_prior]' mode=__info__

Show code cell output Hide code cell output

[2023-08-15 18:39:58,536][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7f15a43f6b80>
[2023-08-15 18:39:58,536][hyfi.main.config][INFO] - HyFi project [nbcpu] initialized
[2023-08-15 18:39:58,736][hyfi.main.main][INFO] - The HyFI config is not instantiatable, running HyFI task with the config
[2023-08-15 18:39:59,574][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7f15785c1220>
[2023-08-15 18:40:00,715][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 18:40:02,146][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 18:40:02,147][hyfi.task.batch][INFO] - Initalized batch: model(0) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/model
[2023-08-15 18:40:02,847][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 18:40:04,591][hyfi.batch.batch][INFO] - Setting seed to 1440953704
[2023-08-15 18:40:04,591][hyfi.batch.batch][INFO] - Init batch - Batch name: model, Batch num: 0
[2023-08-15 18:40:05,390][hyfi.task.batch][INFO] - Initalized batch: corpus(2) in /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/corpus
[2023-08-15 18:40:05,391][hyfi.batch.batch][INFO] - Init batch - Batch name: model, Batch num: 5
[2023-08-15 18:40:05,391][hyfi.task.batch][INFO] - Initalized batch: model(5) in /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model
[2023-08-15 18:40:06,180][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 18:40:06,181][hyfi.task.batch][INFO] - Initalized batch: runner(2) in /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/runner
[2023-08-15 18:40:06,181][hyfi.workflow.workflow][INFO] - Running task [nbcpu-topic_prior] with [run={} verbose=False uses='nbcpu-topic_prior']
  0%|                                                     | 0/1 [00:00<?, ?it/s][2023-08-15 18:40:06,185][thematos.models.prior][INFO] - Loaded 2 words from /home/yjlee/workspace/projects/nbcpu/tests/assets/words/word_prior.yaml
[2023-08-15 18:40:06,185][thematos.models.prior][INFO] - Loaded 2 priors
[2023-08-15 18:40:06,183][thematos.models.base][INFO] - Set word prior with <WordPrior 2 priors>.
[2023-08-15 18:40:06,185][thematos.models.base][INFO] - Set words ['price', 'growth', 'inflation', 'economy'] to topic #0 as prior.
[2023-08-15 18:40:06,185][thematos.datasets.corpus][INFO] - Loading corpus...
[2023-08-15 18:40:06,185][thematos.datasets.corpus][INFO] - Processing documents in the column 'adjnouns'...
[2023-08-15 18:40:34,392][thematos.datasets.corpus][INFO] - Total 27594 documents are loaded.
[2023-08-15 18:40:46,145][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/corpus/corpus_doc_ids.parquet
[2023-08-15 18:40:46,168][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/corpus/configs/corpus(2)_config.json
[2023-08-15 18:40:46,168][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/corpus/configs/corpus(2)_config.yaml
[2023-08-15 18:40:46,788][thematos.models.base][INFO] - Set words ['national_bank_cambodia', 'central_bank', 'nbc', 'national_bank'] to topic #1 as prior.
[2023-08-15 18:40:47,560][thematos.models.lda][INFO] - Number of docs: 27594
[2023-08-15 18:40:47,560][thematos.models.lda][INFO] - Vocab size: 18158
[2023-08-15 18:40:47,561][thematos.models.lda][INFO] - Number of words: 4810963
[2023-08-15 18:40:47,561][thematos.models.lda][INFO] - Removed top words: []
[2023-08-15 18:40:47,561][thematos.models.lda][INFO] - Training model by iterating over the corpus 100 times, 10 iterations at a time with 0 workers

  0%|                                                    | 0/10 [00:00<?, ?it/s][2023-08-15 18:40:51,040][thematos.models.lda][INFO] - Iteration: 0	Log-likelihood: -9.145340999720068

 10%|████▍                                       | 1/10 [00:03<00:31,  3.49s/it][2023-08-15 18:40:54,537][thematos.models.lda][INFO] - Iteration: 10	Log-likelihood: -8.888749004161726

 20%|████████▊                                   | 2/10 [00:06<00:27,  3.49s/it][2023-08-15 18:40:58,485][thematos.models.lda][INFO] - Iteration: 20	Log-likelihood: -8.803375071366277

 30%|█████████████▏                              | 3/10 [00:10<00:26,  3.72s/it][2023-08-15 18:41:02,298][thematos.models.lda][INFO] - Iteration: 30	Log-likelihood: -8.749849464728818

 40%|█████████████████▌                          | 4/10 [00:14<00:22,  3.74s/it][2023-08-15 18:41:06,123][thematos.models.lda][INFO] - Iteration: 40	Log-likelihood: -8.714705246341687

 50%|██████████████████████                      | 5/10 [00:18<00:18,  3.78s/it][2023-08-15 18:41:09,859][thematos.models.lda][INFO] - Iteration: 50	Log-likelihood: -8.690089545139084

 60%|██████████████████████████▍                 | 6/10 [00:22<00:15,  3.76s/it][2023-08-15 18:41:13,547][thematos.models.lda][INFO] - Iteration: 60	Log-likelihood: -8.668976701892165

 70%|██████████████████████████████▊             | 7/10 [00:25<00:11,  3.73s/it][2023-08-15 18:41:17,288][thematos.models.lda][INFO] - Iteration: 70	Log-likelihood: -8.65117961727829

 80%|███████████████████████████████████▏        | 8/10 [00:29<00:07,  3.74s/it][2023-08-15 18:41:20,956][thematos.models.lda][INFO] - Iteration: 80	Log-likelihood: -8.637118155842469

 90%|███████████████████████████████████████▌    | 9/10 [00:33<00:03,  3.71s/it][2023-08-15 18:41:24,630][thematos.models.lda][INFO] - Iteration: 90	Log-likelihood: -8.62476473618058

100%|███████████████████████████████████████████| 10/10 [00:37<00:00,  3.71s/it]
<Basic Info>
| LDAModel (current version: 0.12.5)
| 27594 docs, 4810963 words
| Total Vocabs: 126469, Used Vocabs: 18158
| Entropy of words: 7.96401
| Entropy of term-weighted words: 8.76557
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -8.62476
|
<Initial Parameters>
| tw: TermWeight.IDF
| min_cf: 10 (minimum collection frequency of words)
| min_df: 10 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 10 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 1440953704 (random seed)
| trained in version 0.12.5
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.10728438 0.06657154 0.07608083 0.06004212 0.06778088 0.11082149
|   0.06340487 0.05877907 0.05908309 0.05704119]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| #0 (742782) : price market growth inflation rate economy oil share dollar global stock low index investor trade high export bank debt trading
| #1 (619991) : bank digital customer loan payment banking nbc service financial business insurance company cambodia riel mobile transaction technology smes branch financial_institution
| #2 (403293) : energy india power climate world green climate_change emission technology coal solar global company way malaysia system data renewable_energy human research
| #3 (313718) : tourism tourist chinese airport city hotel visitor flight cambodia passenger province siem_reap angkor travel destination airline temple sihanoukville phnom_penh cambodian
| #4 (421016) : worker factory project export port garment road construction company cambodia investment logistics labour ministry union sector electricity japanese vehicle industry
| #5 (772618) : cambodia tax development asean cooperation economic trade policy government investment woman minister law agreement cambodian social meeting sector sustainable member
| #6 (401326) : election party political military myanmar democracy leader state human_right government opposition protest bangladesh former khmer_rouge right pakistan politics power india
| #7 (386550) : rice farmer water land project agriculture river food fish tonne property province area forest agricultural community crop village price fishery
| #8 (395547) : police court victim commune party drug cnrp prison cpp district law money case election judge provincial suspect crime phnom_penh complaint
| #9 (354122) : health covid-19 vaccine case virus child pandemic vaccination infection huawei disease medical omicron outbreak community patient hospital covid quarantine death
|

[2023-08-15 18:41:25,291][thematos.models.base][INFO] - ==== Coherence : u_mass ====
[2023-08-15 18:41:25,291][thematos.models.base][INFO] - Average: -1.6402675149726402
[2023-08-15 18:41:25,292][thematos.models.base][INFO] - Per Topic: [-1.425266578567007, -1.0729033256163611, -1.9737517568414404, -1.6950357738440491, -1.4372304993750558, -1.0894857078690348, -1.3641567969557398, -1.9651203444009737, -2.0280244511223575, -2.3516999151343816]
[2023-08-15 18:41:25,911][thematos.models.base][INFO] - ==== Coherence : c_uci ====
[2023-08-15 18:41:25,911][thematos.models.base][INFO] - Average: 0.7173903977549194
[2023-08-15 18:41:25,911][thematos.models.base][INFO] - Per Topic: [0.631636197256756, 1.0372400239549309, 1.0279241224565632, 1.1587594141923576, 0.4124652574389333, 0.3280867476153075, 0.9620709742174812, 0.8812465779074348, 0.9351462283826499, -0.20067156587322113]
[2023-08-15 18:41:26,612][thematos.models.base][INFO] - ==== Coherence : c_npmi ====
[2023-08-15 18:41:26,613][thematos.models.base][INFO] - Average: 0.10520817835902437
[2023-08-15 18:41:26,613][thematos.models.base][INFO] - Per Topic: [0.09169010689655911, 0.1513512764679383, 0.11547400424216366, 0.1390206534657676, 0.06185781111793033, 0.05339414794065087, 0.11865170803328073, 0.10947131563054402, 0.12918262344786416, 0.08198813634754491]
[2023-08-15 18:41:30,242][thematos.models.base][INFO] - ==== Coherence : c_v ====
[2023-08-15 18:41:30,242][thematos.models.base][INFO] - Average: 0.7093467557430267
[2023-08-15 18:41:30,242][thematos.models.base][INFO] - Per Topic: [0.7229848951101303, 0.8240539312362671, 0.6891917288303375, 0.7377653867006302, 0.5392432540655137, 0.5775677382946014, 0.7515179306268692, 0.7305361032485962, 0.787589567899704, 0.7330170214176178]
[2023-08-15 18:41:30,303][thematos.models.base][INFO] - Model saved to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/models/LDA_model(5)_k(10).mdl
[2023-08-15 18:41:30,361][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/LDA_model(5)_k(10)-ll_per_word.csv
[2023-08-15 18:41:30,363][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.001810
[2023-08-15 18:41:30,766][thematos.models.base][INFO] - Log-likelihood per word plot saved to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/LDA_model(5)_k(10)-ll_per_word.png
[2023-08-15 18:41:30,824][thematos.models.base][INFO] - ==== Document-Topic Distributions ====
[2023-08-15 18:41:30,825][thematos.models.base][INFO] -           id    topic0    topic1  ...    topic7    topic8    topic9
27589  48239  0.833621  0.000136  ...  0.000120  0.083921  0.000116
27590  48216  0.086417  0.000143  ...  0.133420  0.000127  0.000123
27591  48132  0.000117  0.000073  ...  0.000064  0.319019  0.000062
27592  48115  0.070176  0.000184  ...  0.000163  0.000164  0.000158
27593  48085  0.000108  0.169788  ...  0.000059  0.000060  0.000058

[5 rows x 11 columns]
[2023-08-15 18:41:30,842][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/LDA_model(5)_k(10)-doc_topic_dists.parquet
[2023-08-15 18:41:30,933][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.090841
[2023-08-15 18:41:30,933][thematos.models.base][INFO] - ==== Topic-Word Distributions ====
[2023-08-15 18:41:30,938][thematos.models.base][INFO] -        cambodia  government  ...  rcep_cambodia-china_free  asia_pacific_region
5  7.159818e-03    0.003812  ...              4.491441e-09         4.491441e-09
6  6.598678e-09    0.002988  ...              6.515646e-09         6.515646e-09
7  1.485342e-03    0.001250  ...              7.418155e-09         7.418155e-09
8  2.527701e-04    0.000561  ...              6.831890e-09         6.832542e-09
9  2.897601e-03    0.001661  ...              8.522984e-09         4.052390e-05

[5 rows x 18158 columns]
[2023-08-15 18:41:31,048][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/LDA_model(5)_k(10)-topic_term_dists.parquet
[2023-08-15 18:41:33,325][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:02.276425
[2023-08-15 18:41:33,356][hyfi.utils.iolibs][INFO] - Save the list to the file: /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/LDA_model(5)_k(10)-used_vocab.txt, no. of words: 18158
[2023-08-15 18:41:33,367][hyfi.utils.iolibs][INFO] - Save the list to the file: /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/LDA_model(5)_k(10)-topic_top_words.txt, no. of words: 410
[2023-08-15 18:41:33,368][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/LDA_model(5)_k(10)-topic_top_words_dists.csv
[2023-08-15 18:41:33,370][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.001600
[2023-08-15 18:41:40,187][thematos.models.base][INFO] - Making wordcloud collage with titles: ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9']
[2023-08-15 18:41:40,187][hyfi.graphics.collage][INFO] - Making page 1/1 with 10 images
[2023-08-15 18:41:40,187][hyfi.graphics.collage][INFO] - Page titles: ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9']
[2023-08-15 18:41:40,187][hyfi.graphics.collage][INFO] - Page output file: /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/wordcloud_collage/LDA_model(5)_k(10)_wordcloud_00.png
[2023-08-15 18:41:43,448][hyfi.graphics.utils][INFO] - Saved subplots to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/wordcloud_collage/LDA_model(5)_k(10)_wordcloud_00.png
[2023-08-15 18:41:43,449][thematos.models.base][WARNING] - pyLDAvis is not installed. Please install it to save LDAvis.
[2023-08-15 18:41:43,465][thematos.models.base][INFO] - Model summary saved to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/outputs/model-summary.jsonl
[2023-08-15 18:41:43,465][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/configs/model(5)_config.json
[2023-08-15 18:41:43,466][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/model/configs/model(5)_config.yaml
100%|█████████████████████████████████████████████| 1/1 [01:37<00:00, 97.31s/it]
[2023-08-15 18:41:43,491][thematos.runners.topic][INFO] - Saved summaries to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/runner(2)_summaries.json
[2023-08-15 18:41:43,491][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/runner/configs/runner(2)_config.json
[2023-08-15 18:41:43,492][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_prior/runner/configs/runner(2)_config.yaml

Model Results#

The Latent Dirichlet Allocation (LDA) model, applied to a corpus of 27,594 documents encompassing 4,810,963 words, utilized 18,158 out of 126,469 total vocabs. Configured with 10 topics and specific hyperparameters for alpha and eta, the model underwent 100 iterations without burn-in steps, with an optimization interval of 10.

The resultant topics reflect a diverse spectrum of subjects, including economics, finance, politics, technology, and social matters. Through the incorporation of prior information, the model was steered towards themes pertinent to central bank policy and economic indicators. Specifically:

Topic #0 aligns with the economic indicators’ prior, emphasizing price, market growth, inflation, and global economic aspects.
Topic #1 resonates with the banking sector, including central banking, mirroring the prior set for banking and financial services.
Topic #7 encapsulates trade, investment, and regional cooperation facets, including ASEAN, indirectly correlating with economic policy and central banking.

These topics collectively illustrate the model’s adeptness in leveraging prior information to hone in on areas of interest such as central banking and economic indicators. The coherence of the topics underscores their interpretability and relevance to the overarching research focus.

Fig. 5 shows the wordcloud of the top 500 words in each topic from the LDA model with 10 topics and prior. The size of the word is proportional to the frequency of the word in the topic.

../../_images/LDA_model%280%29_k%2810%29_wordcloud_00.png — Fig. 5 Wordcloud of the top 500 words in each topic from the LDA model with 10 topics and prior. The size of the word is proportional to the frequency of the word in the topic.#