Topic Modeling for Uncertainty Analysis#

The task of classifying uncertainty within the context of topic modeling requires a nuanced and specialized approach, diverging from conventional methods. This methodology is characterized by three distinct modifications, each contributing to the precise identification and analysis of uncertainty-related themes.

Three Modifications#

1. Integration of Prior Knowledge#

The model is designed to incorporate prior knowledge, which serves as a guiding framework for identifying topics specifically related to uncertainty. The defined priors include:

  • Prior 0: Targeting terms synonymous with uncertainty, such as “uncertain,” “risk,” and “uncertainty.”

  • Prior 1: Concentrating on terms that denote enhancement or fortification, such as “improve,” “strengthen,” “ensure,” and “enhance.”

These priors strategically channel the model’s focus towards themes intrinsically connected to uncertainty, thereby refining the relevance and specificity of the extracted topics.

2. Strategic Removal of Stop Words#

Beyond the elimination of standard stop words, the model is configured to exclude the top 100 words from each of the 10 topics identified in the previous model, totaling 896 words. This exclusion is carefully executed to preserve uncertainty-related words through manual inspection. By filtering out these common but non-contributory words, the model is enabled to concentrate on terms that are contextually significant and pertinent to the understanding of uncertainty.

3. Fine-Tuning of Model Parameters#

The model’s parameters are meticulously adjusted to resonate with the unique objective of classifying uncertainty. The key modifications include:

  • min_cf: Set at 500, this parameter stipulates the minimum collection frequency of words, focusing on terms with substantial corpus-wide presence.

  • min_df: Also set at 500, this parameter governs the minimum document frequency of words, accentuating terms that recur across diverse documents.

These deliberate parameter adjustments, coupled with the strategic integration of prior knowledge and the removal of specific stop words, culminate in a bespoke approach to topic modeling. This approach is adept at capturing and classifying uncertainty-related themes, aligning seamlessly with the overarching research goals.

Configuration Details#

The configuration for this specialized topic modeling with uncertainty analysis is encapsulated in the following command:

!nbcpu +model=nbcpu-topic_uncertainty noop=1
Hide code cell output
## Command Line Interface for HyFI ##
{'about': {'authors': 'Young Joon Lee <entelecheia@hotmail.com>',
           'description': 'Quantifying Central Bank Policy Uncertainty in a '
                          'Highly Dollarized Economy: A Topic Modeling '
                          'Approach',
           'homepage': 'https://nbcpu.entelecheia.ai',
           'license': 'MIT',
           'name': 'Measuring Central Bank Policy Uncertainty'},
 'debug_mode': False,
 'dryrun': False,
 'hydra_log_dir': '/home/yjlee/.hyfi/logs/hydra',
 'ignore_warnings': True,
 'logging_level': 'WARNING',
 'model': {'_config_group_': '/model',
           '_config_name_': 'lda',
           '_target_': 'thematos.models.lda.LdaModel',
           'autosave': True,
           'batch': {'_config_group_': '/batch',
                     '_config_name_': '__init__',
                     'batch_name': 'model',
                     'batch_num': None,
                     'batch_num_auto': False,
                     'batch_root': 'workspace/topic',
                     'config_dirname': 'configs',
                     'config_json': 'config.json',
                     'config_yaml': 'config.yaml',
                     'device': 'cpu',
                     'num_devices': 1,
                     'num_workers': 1,
                     'output_extention': None,
                     'output_suffix': None,
                     'random_seed': False,
                     'resume_latest': False,
                     'resume_run': False,
                     'seed': -1,
                     'verbose': True},
           'batch_name': 'model',
           'coherence_metric_list': ['u_mass', 'c_uci', 'c_npmi', 'c_v'],
           'corpus': {'_config_group_': '/dataset',
                      '_config_name_': 'topic_corpus',
                      '_target_': 'thematos.datasets.corpus.Corpus',
                      'batch': {'_config_group_': '/batch',
                                '_config_name_': '__init__',
                                'batch_name': 'corpus',
                                'batch_num': None,
                                'batch_num_auto': False,
                                'batch_root': 'workspace/topic',
                                'config_dirname': 'configs',
                                'config_json': 'config.json',
                                'config_yaml': 'config.yaml',
                                'device': 'cpu',
                                'num_devices': 1,
                                'num_workers': 1,
                                'output_extention': None,
                                'output_suffix': None,
                                'random_seed': False,
                                'resume_latest': False,
                                'resume_run': False,
                                'seed': -1,
                                'verbose': True},
                      'batch_name': 'corpus',
                      'data_load': {'_target_': 'hyfi.utils.datasets.load.DSLoad.load_dataframe',
                                    'columns': None,
                                    'data_dir': None,
                                    'data_file': 'datasets/processed/topic_noprior_filtered/train.parquet',
                                    'filetype': None,
                                    'index_col': None,
                                    'verbose': False},
                      'id_col': 'id',
                      'module': None,
                      'ngramize': True,
                      'ngrams': {'_config_group_': '/ngrams',
                                 '_config_name_': 'tp_ngrams',
                                 '_target_': 'thematos.datasets.ngrams.NgramConfig',
                                 'delimiter': '_',
                                 'max_cand': 5000,
                                 'max_len': 3,
                                 'min_cf': 20,
                                 'min_df': 10,
                                 'min_score': 0.5,
                                 'normalized': True,
                                 'workers': 0},
                      'path': {'_config_name_': '__batch__',
                               'batch_name': 'corpus',
                               'task_name': 'topic',
                               'task_root': 'workspace'},
                      'pipelines': [],
                      'stopwords': {'_config_group_': '/stopwords',
                                    '_config_name_': '__init__',
                                    '_target_': 'lexikanon.stopwords.Stopwords',
                                    'lowercase': True,
                                    'name': 'stopwords',
                                    'nltk_stopwords_lang': None,
                                    'stopwords_fn': None,
                                    'stopwords_list': None,
                                    'stopwords_path': '/home/yjlee/.hyfi/logs/hydra/hyfi/2023-08-15/2023-08-15_19-05-04/tests/assets/stopwords/nbcpu-uncertainty.txt',
                                    'verbose': True},
                      'task_name': 'topic',
                      'task_root': 'workspace',
                      'text_col': 'tokens',
                      'timestamp_col': 'time',
                      'verbose': True,
                      'version': '0.0.0'},
           'eval_coherence': True,
           'model_args': {'_config_group_': '/model/config',
                          '_config_name_': 'lda',
                          '_target_': 'thematos.models.config.LdaConfig',
                          'alpha': 0.1,
                          'eta': 0.01,
                          'k': 10,
                          'min_cf': 500,
                          'min_df': 500,
                          'rm_top': 0,
                          'tw': 1},
           'model_type': 'LDA',
           'module': None,
           'path': {'_config_name_': '__batch__',
                    'batch_name': 'model',
                    'task_name': 'topic',
                    'task_root': 'workspace'},
           'pipelines': [],
           'save_full': True,
           'set_wordprior': True,
           'task_name': 'topic',
           'task_root': 'workspace',
           'train_args': {'_config_group_': '/model/train',
                          '_config_name_': 'topic',
                          '_target_': 'thematos.models.config.TrainConfig',
                          'burn_in': 0,
                          'interval': 10,
                          'iterations': 100},
           'train_summary_args': {'_config_group_': '/model/summary',
                                  '_config_name_': 'topic_train',
                                  '_target_': 'thematos.models.config.TrainSummaryConfig',
                                  'flush': False,
                                  'initial_hp': True,
                                  'params': True,
                                  'topic_word_top_n': 10},
           'verbose': True,
           'version': '0.0.0',
           'wc_args': {'_config_group_': '/model/plot',
                       '_config_name_': 'wordcloud',
                       '_target_': 'thematos.models.config.WordcloudConfig',
                       'dpi': 300,
                       'figsize': None,
                       'fontpath': None,
                       'height_multiple': 2,
                       'make_collage': True,
                       'mask_dir': None,
                       'num_cols': 5,
                       'num_images_per_page': 20,
                       'num_rows': None,
                       'output_file_format': 'wordcloud_p{page_num:02d}.png',
                       'save': True,
                       'title_color': 'green',
                       'title_fontsize': 14,
                       'titles': None,
                       'top_n': 500,
                       'wc': {'_config_group_': '/plot',
                              '_config_name_': 'wordcloud',
                              '_target_': 'thematos.plots.wordcloud.WordCloud',
                              'background_color': 'black',
                              'collocation_threshold': 30,
                              'collocations': True,
                              'color_func': None,
                              'colormap': 'PuBu',
                              'contour_color': 'steelblue',
                              'contour_width': 0,
                              'font_path': None,
                              'font_step': 1,
                              'height': 200,
                              'include_numbers': False,
                              'mask': None,
                              'max_font_size': None,
                              'max_words': 200,
                              'min_font_size': 4,
                              'min_word_length': 0,
                              'mode': 'RGB',
                              'normalize_plurals': True,
                              'prefer_horizontal': 0.9,
                              'regexp': None,
                              'relative_scaling': 'auto',
                              'repeat': False,
                              'scale': 1,
                              'stopwords': None,
                              'width': 400},
                       'width_multiple': 4},
           'wordprior': {'_config_group_': '/words',
                         '_config_name_': 'wordprior',
                         '_target_': 'thematos.models.prior.WordPrior',
                         'data_file': '/home/yjlee/.hyfi/logs/hydra/hyfi/2023-08-15/2023-08-15_19-05-04/tests/assets/words/word_prior_uncertainty.yaml',
                         'lowercase': True,
                         'max_prior_weight': 1.0,
                         'min_prior_weight': 0.01,
                         'prior_data': None,
                         'verbose': True}},
 'noop': 1,
 'resolve': True,
 'verbose': False,
 'version': '0.15.0'}

Dryrun is enabled, not running the HyFI config


Running the Workflow#

The entire workflow can be executed using the following command:

!nbcpu +workflow=nbcpu tasks='[nbcpu-topic_uncertainty]' mode=__info__
Hide code cell output
[2023-08-15 19:07:43,714][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7f5a586f5670>
[2023-08-15 19:07:43,714][hyfi.main.config][INFO] - HyFi project [nbcpu] initialized
[2023-08-15 19:07:43,912][hyfi.main.main][INFO] - The HyFI config is not instantiatable, running HyFI task with the config
[2023-08-15 19:07:44,736][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7f5a387bf250>
[2023-08-15 19:07:45,870][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 19:07:47,283][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 19:07:47,283][hyfi.task.batch][INFO] - Initalized batch: model(0) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/model
[2023-08-15 19:07:47,976][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 19:07:49,701][hyfi.batch.batch][INFO] - Setting seed to 590167864
[2023-08-15 19:07:49,701][hyfi.batch.batch][INFO] - Init batch - Batch name: model, Batch num: 0
[2023-08-15 19:07:50,313][hyfi.batch.batch][INFO] - Setting seed to 2225682814
[2023-08-15 19:07:50,313][hyfi.batch.batch][INFO] - Init batch - Batch name: corpus, Batch num: 0
[2023-08-15 19:07:50,485][hyfi.batch.batch][INFO] - Init batch - Batch name: corpus, Batch num: 1
[2023-08-15 19:07:50,485][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/corpus
[2023-08-15 19:07:50,485][hyfi.batch.batch][INFO] - Init batch - Batch name: model, Batch num: 4
[2023-08-15 19:07:50,486][hyfi.task.batch][INFO] - Initalized batch: model(4) in /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model
[2023-08-15 19:07:51,274][hyfi.task.batch][INFO] - Initalized batch: corpus(1) in /home/yjlee/workspace/projects/nbcpu/workspace/topic/corpus
[2023-08-15 19:07:51,274][hyfi.task.batch][INFO] - Initalized batch: runner(1) in /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/runner
[2023-08-15 19:07:51,275][hyfi.workflow.workflow][INFO] - Running task [nbcpu-topic_uncertainty] with [run={} verbose=False uses='nbcpu-topic_uncertainty']
  0%|                                                     | 0/1 [00:00<?, ?it/s][2023-08-15 19:07:51,278][thematos.models.prior][INFO] - Loaded 2 words from /home/yjlee/workspace/projects/nbcpu/tests/assets/words/word_prior_uncertainty.yaml
[2023-08-15 19:07:51,278][thematos.models.prior][INFO] - Loaded 2 priors
[2023-08-15 19:07:51,277][thematos.models.base][INFO] - Set word prior with <WordPrior 2 priors>.
[2023-08-15 19:07:51,278][thematos.models.base][INFO] - Set words ['uncertain', 'uncertainty', 'risk'] to topic #0 as prior.
[2023-08-15 19:07:51,279][thematos.datasets.corpus][INFO] - Loading corpus...
[2023-08-15 19:07:51,279][thematos.datasets.corpus][INFO] - Processing documents in the column 'tokens'...
[2023-08-15 19:07:53,478][thematos.datasets.corpus][INFO] - 0    [national, bank, cambodia, nbc, ask, bank, fin...
1    [national, bank, cambodia, prepare, draft, pro...
2    [national, bank, cambodia, prepare, draft, pro...
3    [national, bank, cambodia, nbc, monday, sign, ...
4    [average, price, good, service, supply, market...
Name: tokens, dtype: object
[2023-08-15 19:07:53,481][thematos.datasets.corpus][INFO] - The first document: ['national' 'bank' 'cambodia' 'nbc' 'ask' 'bank' 'financial' 'institution'
 'consider' 'provide' 'loan' 'consumer' 'base' 'behaviour' 'data'
 'increase' 'financial' 'inclusion' 'digital' 'payment' 'develop' 'fast'
 'country' 'recommendation' 'make' 'launch' 'cambodian' 'share' 'switch'
 'event' 'common' 'payment' 'withdrawal' 'recently' 'organise' 'center'
 'bank' 'study' 'cbs' 'phnom' 'penh' 'preside' 'deputy' 'governor' 'nbc'
 'chea' 'serey' 'event' 'also' 'attend' 'national' 'international' 'guest'
 'central' 'bank' 'private' 'bank' 'financial' 'institution' 'include'
 'sok' 'voeun' 'chairman' 'cambodia' 'microfinance' 'association' 'cma'
 'char' 'sopheap' 'payment' 'committee' 'chairman' 'association' 'bank'
 'cambodia' 'abc' 'ivana' 'tranchini' 'visa' 'country' 'manager'
 'cambodia' 'payment' 'system' 'important' 'financial' 'inclusion'
 'process' "n't" 'want' 'make' 'payment' 'easy' 'also' 'want' 'make'
 'access' 'credit' 'easy' 'comfortable' 'potential' 'look' 'possibility'
 'lending' 'provide' 'credit' 'base' 'payment' 'behaviour' 'client' 'talk'
 'big' 'amount' 'small' 'amount' 'serey' 'underline' 'could' 'understand'
 'behaviour' 'client' 'someone' 'cautious' 'spender' 'someone' 'habit'
 'save' 'lot' 'get' 'benefit' 'term' 'low' 'interest' 'rate' 'financial'
 'institution' 'hope' 'continue' 'push' 'forward' 'usage' 'data' 'help'
 'access' 'financial' 'service' 'serey' 'say' 'serey' 'tell' 'khmer'
 'time' 'sideline' 'event' 'bank' 'financial' 'institution' 'currently'
 'use' 'necessary' 'data' 'collateral' 'identity' 'income' 'client'
 'assess' 'possibility' 'lend' 'data' 'consumer' 'behaviour' 'already'
 'exist' 'use' 'assessment' 'decision-making' 'loan' 'release' 'serey'
 'far' 'point' 'require' 'bank' 'financial' 'institution' 'must' 'get'
 'consent' 'client' 'use' 'type' 'behaviour' 'data' 'protect' 'client'
 'privacy' 'client' 'want' 'get' 'loan' 'allow' 'bank' 'use' 'information'
 'credit' 'assessment' 'would' 'fine' 'trade-off' 'say' 'example' 'people'
 'spend' 'income' 'financial' 'institution' 'use' 'data' 'consider'
 'decide' 'lend' 'people' 'behaviour' 'might' 'believe' 'unable' 'make'
 'repayment' 'financial' 'institution' 'lender' 'however' 'people' 'habit'
 'spending' 'much' 'less' 'income' 'may' 'get' 'credit' 'easy' 'serey'
 'add' 'bank' 'financial' 'institution' 'officer' 'also' 'need' 'explain'
 'clearly' 'advantage' 'disadvantage' 'customer' 'addition' 'request'
 'consent' 'data' 'usage' 'financial' 'service' 'consumer' 'may' 'lack'
 'knowledge' 'understand' 'consequence' 'say' 'add' 'officer' 'ask' 'make'
 'decision' 'permission' 'serey' 'also' 'say' 'number' 'payment' 'make'
 'electronic' 'system' 'cambodia' 'reach' 'nearly' 'percent' 'country'
 'gross' 'domestic' 'product' 'gdp' 'rise' 'approximately'
 'billion-nearly' 'percent' 'gdp' 'last' 'year' 'electronic' 'payment'
 'volume' 'rise' 'percent' 'annually' 'transaction' 'amount' 'payment'
 'voeun' 'also' 'ceo' 'deposit-taking' 'microfinance' 'institution' 'lolc'
 'cambodia' 'plc' 'ask' 'nbc' 'work' 'participate' 'institution' 'set'
 'maximum' 'limit' 'day' 'amount' 'withdrawable' 'cash' 'transaction'
 'atm']
[2023-08-15 19:07:53,740][hyfi.utils.iolibs][INFO] - Loaded the file: /home/yjlee/workspace/projects/nbcpu/tests/assets/stopwords/nbcpu-uncertainty.txt, No. of words: 875
[2023-08-15 19:07:53,741][hyfi.utils.iolibs][INFO] - Remove duplicate words, No. of words: 871
[2023-08-15 19:07:53,741][lexikanon.stopwords.stopwords][INFO] - Loaded 871 stopwords from /home/yjlee/workspace/projects/nbcpu/tests/assets/stopwords/nbcpu-uncertainty.txt
[2023-08-15 19:07:53,741][lexikanon.stopwords.stopwords][INFO] - Loaded 871 stopwords
[2023-08-15 19:09:03,836][thematos.datasets.corpus][INFO] - The first document: ['national' 'bank' 'cambodia' 'nbc' 'ask' 'bank' 'financial' 'institution'
 'consider' 'provide' 'loan' 'consumer' 'base' 'behaviour' 'data'
 'increase' 'financial' 'inclusion' 'digital' 'payment' 'develop' 'fast'
 'country' 'recommendation' 'make' 'launch' 'cambodian' 'share' 'switch'
 'event' 'common' 'payment' 'withdrawal' 'recently' 'organise' 'center'
 'bank' 'study' 'cbs' 'phnom' 'penh' 'preside' 'deputy' 'governor' 'nbc'
 'chea' 'serey' 'event' 'also' 'attend' 'national' 'international' 'guest'
 'central' 'bank' 'private' 'bank' 'financial' 'institution' 'include'
 'sok' 'voeun' 'chairman' 'cambodia' 'microfinance' 'association' 'cma'
 'char' 'sopheap' 'payment' 'committee' 'chairman' 'association' 'bank'
 'cambodia' 'abc' 'ivana' 'tranchini' 'visa' 'country' 'manager'
 'cambodia' 'payment' 'system' 'important' 'financial' 'inclusion'
 'process' "n't" 'want' 'make' 'payment' 'easy' 'also' 'want' 'make'
 'access' 'credit' 'easy' 'comfortable' 'potential' 'look' 'possibility'
 'lending' 'provide' 'credit' 'base' 'payment' 'behaviour' 'client' 'talk'
 'big' 'amount' 'small' 'amount' 'serey' 'underline' 'could' 'understand'
 'behaviour' 'client' 'someone' 'cautious' 'spender' 'someone' 'habit'
 'save' 'lot' 'get' 'benefit' 'term' 'low' 'interest' 'rate' 'financial'
 'institution' 'hope' 'continue' 'push' 'forward' 'usage' 'data' 'help'
 'access' 'financial' 'service' 'serey' 'say' 'serey' 'tell' 'khmer'
 'time' 'sideline' 'event' 'bank' 'financial' 'institution' 'currently'
 'use' 'necessary' 'data' 'collateral' 'identity' 'income' 'client'
 'assess' 'possibility' 'lend' 'data' 'consumer' 'behaviour' 'already'
 'exist' 'use' 'assessment' 'decision-making' 'loan' 'release' 'serey'
 'far' 'point' 'require' 'bank' 'financial' 'institution' 'must' 'get'
 'consent' 'client' 'use' 'type' 'behaviour' 'data' 'protect' 'client'
 'privacy' 'client' 'want' 'get' 'loan' 'allow' 'bank' 'use' 'information'
 'credit' 'assessment' 'would' 'fine' 'trade-off' 'say' 'example' 'people'
 'spend' 'income' 'financial' 'institution' 'use' 'data' 'consider'
 'decide' 'lend' 'people' 'behaviour' 'might' 'believe' 'unable' 'make'
 'repayment' 'financial' 'institution' 'lender' 'however' 'people' 'habit'
 'spending' 'much' 'less' 'income' 'may' 'get' 'credit' 'easy' 'serey'
 'add' 'bank' 'financial' 'institution' 'officer' 'also' 'need' 'explain'
 'clearly' 'advantage' 'disadvantage' 'customer' 'addition' 'request'
 'consent' 'data' 'usage' 'financial' 'service' 'consumer' 'may' 'lack'
 'knowledge' 'understand' 'consequence' 'say' 'add' 'officer' 'ask' 'make'
 'decision' 'permission' 'serey' 'also' 'say' 'number' 'payment' 'make'
 'electronic' 'system' 'cambodia' 'reach' 'nearly' 'percent' 'country'
 'gross' 'domestic' 'product' 'gdp' 'rise' 'approximately'
 'billion-nearly' 'percent' 'gdp' 'last' 'year' 'electronic' 'payment'
 'volume' 'rise' 'percent' 'annually' 'transaction' 'amount' 'payment'
 'voeun' 'also' 'ceo' 'deposit-taking' 'microfinance' 'institution' 'lolc'
 'cambodia' 'plc' 'ask' 'nbc' 'work' 'participate' 'institution' 'set'
 'maximum' 'limit' 'day' 'amount' 'withdrawable' 'cash' 'transaction'
 'atm']
[2023-08-15 19:09:04,085][thematos.datasets.corpus][INFO] - Total 27594 documents are loaded.
[2023-08-15 19:09:04,087][thematos.datasets.corpus][INFO] - Extracting n-grams...
[2023-08-15 19:09:09,949][thematos.datasets.corpus][INFO] - Total 1971 n-grams are extracted.
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["prei","kuk"], name="", score=0.997459)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["ភុន","ច័ន្ទឧសភា"], name="", score=0.997435)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["cashville","kidz"], name="", score=0.996442)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["modus","operandi"], name="", score=0.996309)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["suu","kyi"], name="", score=0.996141)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["lese","majeste"], name="", score=0.994870)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["ហ៊ុន","សែន"], name="", score=0.994573)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["bharatiya","janata"], name="", score=0.993389)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["los","angeles"], name="", score=0.989665)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["kuala","lumpur"], name="", score=0.987660)
[2023-08-15 19:09:09,951][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["reduced","rapid"], name="", score=0.501001)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["entry","exit"], name="", score=0.500969)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["constitutional","amendment"], name="", score=0.500843)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["governor","keut"], name="", score=0.500629)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["severely","affect"], name="", score=0.500479)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["liberation","front"], name="", score=0.500423)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["emphasize","importance"], name="", score=0.500367)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["weapon","mass"], name="", score=0.500308)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["highlight","importance"], name="", score=0.500183)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - tomotopy.label.Candidate(words=["atm","machine"], name="", score=0.500026)
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - Concatenating n-grams...
[2023-08-15 19:09:09,952][thematos.datasets.corpus][INFO] - <tomotopy.Document with words="ask consider provide base behaviour increase inclusion fast recommendation launch switch common withdrawal recently organise center cbs preside deputy governor chea serey attend guest voeun chairman cma char sopheap chairman abc ivana tranchini manager inclusion process easy easy comfortable potential look possibility provide base behaviour talk amount amount serey underline understand behaviour someone cautious spender someone habit save lot benefit term hope push forward usage serey serey sideline currently necessary collateral identity income assess possibility lend behaviour exist assessment decision-making release serey require consent type behaviour protect privacy assessment fine trade-off example spend income consider decide lend behaviour might believe unable repayment lender habit less income easy serey add explain clearly advantage disadvantage addition request consent usage lack knowledge understand consequence add ask permission serey electronic nearly gross domestic approximately billion-nearly electronic annually amount voeun ceo deposit-taking lolc ask participate maximum limit amount withdrawable atm">
[2023-08-15 19:09:09,953][thematos.datasets.corpus][INFO] - <tomotopy.Document with words="san francisco march twenty-five ago wide web idea technical paper obscure scientist physic lab idea tim berners-lee cern lab switzerland outline easily file linked pave phenomenon touch billion present paper march mark birthday web idea bold almost happen tremendous amount hubris beginning marc weber creator curator silicon valley tim berners-lee propose blue unrequested weber cern colleague completely ignore proposal web rival begin idea connect launched arpanet forerunner wide web several idea connect berners-lee convince cern adopt demonstrate usefulness compile lab phone aspect design put forward berners-lee across various operate ability click link file host locate elsewhere web winner gate rival us-based compuserve minitel involve fee berners-lee free real underdog one predict succeed weber gopher minnesota beat web early gore white weber vice gore web topple gopher agency launch whitehouse gov website huge stamp approval web web release free gopher weber realize web competitor weber lose battle certainly different lot top-down wall garden web competitor berners-lee free publish wish internet-linked titan yahoo page amount host server explode disrupt birth guilty lack imagination web gartner michael mcguire personal web disrupt lot ability freely file web shake traditional music film push edge jim dempsey vice us-based center anybody listener anybody publisher never anything powerful underlying tenet web egalitarian principle dempsey remain web hobble regulation fragment wall portion freedom threaten never stop teenage kid watch little snippet cute dempsey trouble limit ability criticize tiered hard innovator critic activist audience threats web base equality concern creator weber web unify decade ago nothing write stone fragment anew historian reason provider win preferential willingness invade privacy restrain web freedom battle shape web effect billion smartphones web really half worldwide yet weber">
[2023-08-15 19:09:10,248][thematos.datasets.corpus][INFO] - <tomotopy.Document with words="ask consider provide base behaviour increase inclusion fast recommendation launch switch common withdrawal recently organise center cbs preside deputy_governor chea_serey attend guest voeun chairman cma char sopheap chairman abc ivana tranchini manager inclusion process easy easy comfortable potential look possibility provide base behaviour talk amount amount serey underline understand behaviour someone cautious spender someone habit save lot benefit term hope push forward usage serey serey sideline currently necessary collateral identity income assess possibility lend behaviour exist assessment decision-making release serey require consent type behaviour protect privacy assessment fine trade-off example spend income consider decide lend behaviour might believe unable repayment lender habit less income easy serey add explain clearly advantage disadvantage addition request consent usage lack knowledge understand consequence add ask permission serey electronic nearly gross_domestic approximately billion-nearly electronic annually amount voeun ceo deposit-taking lolc ask participate maximum limit amount withdrawable atm">
[2023-08-15 19:09:10,248][thematos.datasets.corpus][INFO] - <tomotopy.Document with words="san_francisco march twenty-five ago wide web idea technical paper obscure scientist physic lab idea tim berners-lee cern lab switzerland outline easily file linked pave phenomenon touch billion present paper march mark birthday web idea bold almost happen tremendous amount hubris beginning marc weber creator curator silicon_valley tim berners-lee propose blue unrequested weber cern colleague completely ignore proposal web rival begin idea connect launched arpanet forerunner wide web several idea connect berners-lee convince cern adopt demonstrate usefulness compile lab phone aspect design put forward berners-lee across various operate ability click link file host locate elsewhere web winner gate rival us-based compuserve minitel involve fee berners-lee free real underdog one predict succeed weber gopher minnesota beat web early gore white weber vice gore web topple gopher agency launch whitehouse gov website huge stamp approval web web release free gopher weber realize web competitor weber lose battle certainly different lot top-down wall garden web competitor berners-lee free publish wish internet-linked titan yahoo page amount host server explode disrupt birth guilty lack imagination web gartner michael mcguire personal web disrupt lot ability freely file web shake traditional music film push edge jim dempsey vice us-based center anybody listener anybody publisher never anything powerful underlying tenet web egalitarian principle dempsey remain web hobble regulation fragment wall portion freedom threaten never stop teenage kid watch little snippet cute dempsey trouble limit ability criticize tiered hard innovator critic activist audience threats web base equality concern creator weber web unify decade ago nothing write stone fragment anew historian reason provider win preferential willingness invade privacy restrain web freedom battle shape web effect billion smartphones web really half worldwide yet weber">
[2023-08-15 19:09:10,248][thematos.datasets.corpus][INFO] - Total 27594 documents are n-gramized.
[2023-08-15 19:09:10,250][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/corpus/corpus_doc_ids.parquet
[2023-08-15 19:09:10,273][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.022347
[2023-08-15 19:09:10,274][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/corpus/configs/corpus(1)_config.json
[2023-08-15 19:09:10,274][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/corpus/configs/corpus(1)_config.yaml
[2023-08-15 19:09:10,871][thematos.models.base][INFO] - Set words ['enhance', 'ensure', 'improve', 'strengthen'] to topic #1 as prior.
[2023-08-15 19:09:11,425][thematos.models.lda][INFO] - Number of docs: 27594
[2023-08-15 19:09:11,425][thematos.models.lda][INFO] - Vocab size: 1102
[2023-08-15 19:09:11,425][thematos.models.lda][INFO] - Number of words: 1916988
[2023-08-15 19:09:11,425][thematos.models.lda][INFO] - Removed top words: []
[2023-08-15 19:09:11,425][thematos.models.lda][INFO] - Training model by iterating over the corpus 100 times, 10 iterations at a time with 0 workers

  0%|                                                    | 0/10 [00:00<?, ?it/s][2023-08-15 19:09:15,120][thematos.models.lda][INFO] - Iteration: 0	Log-likelihood: -7.758692721106333

 10%|████▍                                       | 1/10 [00:03<00:33,  3.70s/it][2023-08-15 19:09:18,869][thematos.models.lda][INFO] - Iteration: 10	Log-likelihood: -7.343293556088631

 20%|████████▊                                   | 2/10 [00:07<00:29,  3.73s/it][2023-08-15 19:09:22,708][thematos.models.lda][INFO] - Iteration: 20	Log-likelihood: -7.2300475972122795

 30%|█████████████▏                              | 3/10 [00:11<00:26,  3.78s/it][2023-08-15 19:09:26,708][thematos.models.lda][INFO] - Iteration: 30	Log-likelihood: -7.181403094694064

 40%|█████████████████▌                          | 4/10 [00:15<00:23,  3.87s/it][2023-08-15 19:09:30,682][thematos.models.lda][INFO] - Iteration: 40	Log-likelihood: -7.15484932973934

 50%|██████████████████████                      | 5/10 [00:19<00:19,  3.91s/it][2023-08-15 19:09:34,609][thematos.models.lda][INFO] - Iteration: 50	Log-likelihood: -7.137309273065852

 60%|██████████████████████████▍                 | 6/10 [00:23<00:15,  3.91s/it][2023-08-15 19:09:38,468][thematos.models.lda][INFO] - Iteration: 60	Log-likelihood: -7.122647192782161

 70%|██████████████████████████████▊             | 7/10 [00:27<00:11,  3.90s/it][2023-08-15 19:09:42,328][thematos.models.lda][INFO] - Iteration: 70	Log-likelihood: -7.112307842481844

 80%|███████████████████████████████████▏        | 8/10 [00:30<00:07,  3.88s/it][2023-08-15 19:09:46,239][thematos.models.lda][INFO] - Iteration: 80	Log-likelihood: -7.103769303813615

 90%|███████████████████████████████████████▌    | 9/10 [00:34<00:03,  3.89s/it][2023-08-15 19:09:50,140][thematos.models.lda][INFO] - Iteration: 90	Log-likelihood: -7.095995534976736

100%|███████████████████████████████████████████| 10/10 [00:38<00:00,  3.87s/it]
<Basic Info>
| LDAModel (current version: 0.12.5)
| 27594 docs, 1916988 words
| Total Vocabs: 131732, Used Vocabs: 1102
| Entropy of words: 6.75978
| Entropy of term-weighted words: 6.90695
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -7.09600
|
<Initial Parameters>
| tw: TermWeight.IDF
| min_cf: 500 (minimum collection frequency of words)
| min_df: 500 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 20 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 590167864 (random seed)
| trained in version 0.12.5
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.02006483 0.04694471 0.01432557 0.00799661 0.01058124 0.01064451
|   0.00866108 0.00949337 0.00843484 0.00827856 0.0124673  0.01257223
|   0.01054008 0.01247833 0.01060978 0.01064283 0.01324215 0.01039483
|   0.01151689 0.01167775]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| #0 (174430) : risk slow recovery cut remain fall fiscal weak hike likely slowdown grow strong domestic recession decline uncertainty increase survey fell
| #1 (370043) : improve provide strengthen ensure enhance achieve implement increase contribute aim framework reduce goal address technical benefit forum strategic stakeholder financing
| #2 (91486) : increase zone add special siem_reap provide attract korean invest produce expand boost ltd director complete facility collect income minimum operation
| #3 (48287) : dispute arrest gold claim resolve ask accuse trust appeal title sell file send name seize involve fine mother problem try
| #4 (95614) : attack reform citizen fact respect fight problem think win never organisation protect bear view act always clear sam word understand
| #5 (76207) : list korean increase domestic negotiation free minimum boost incentive renewable raise stake attract potential reduce invest sign asset value footwear
| #6 (59974) : accident hall ask concern protest spokesman add representative commission request ban process block claim accept hold remove administration assembly affect
| #7 (68521) : kill arrest live leave attack die night try ask back never stop sell later mother lose away incident think protest
| #8 (63445) : protest activist win freedom reform movement rally attack ban cabinet sanction poll claim assembly accuse sunday civil act hold post
| #9 (53266) : fell york gain trader cut wall rally equity earnings drop loss fall profit hike yield hit corp surge record weak
| #10 (102431) : live problem affect poor cause often think risk damage less bad enough long happen back die protect loss severe leave
| #11 (88649) : siem_reap park kampot complete sell facility centre route connect phase locate produce ship resort hour add center line koh link
| #12 (70812) : daily record reopen cent announce vaccinate today recovery drop ban remain household situation germany normal increase surge cause poor direct
| #13 (62981) : increase abc respectively drop gain value compare pwsa gti period decline year-on-year decrease fell compare_period record fall footwear profit finish
| #14 (63674) : arrest kandal kill live accuse send night son die ask koh morning body accident incident sell leave collapse vaccinate identify
| #15 (68178) : article draft document accuse arrest request assembly fine appeal code submit ask procedure file register permit convict add send require
| #16 (114897) : youth experience provide best launch benefit feature team grow ceo different learn together idea design foundation example various traditional understand
| #17 (70406) : asset transfer ceo shareholder corporate regulator list trust executive provide portfolio launch operation grow best provider experience staff value agent
| #18 (92694) : real_estate buy agent purchase transfer grow store provide ceo look affordable launch easy option experience re sell available invest convenient
| #19 (80993) : request assembly register add draft affair spokesman civil sam reform process deputy discuss administration relevant submit ask hold procedure governor
|

[2023-08-15 19:09:50,965][thematos.models.base][INFO] - ==== Coherence : u_mass ====
[2023-08-15 19:09:50,965][thematos.models.base][INFO] - Average: -2.0675997963694988
[2023-08-15 19:09:50,965][thematos.models.base][INFO] - Per Topic: [-1.9086247983459215, -1.3976213790853773, -2.1882474353474812, -2.5721022891995746, -1.9975944984014402, -2.7226100753271103, -2.0055773674467345, -1.6916198990274103, -2.4384248735913547, -1.8001375528751784, -1.826303930648957, -2.4255806540436002, -2.220352322267821, -2.1150062333273456, -2.0433877722588516, -2.2476214746839625, -1.8547688291891224, -1.8227270593580585, -1.799965619790992, -2.2737218631736935]
[2023-08-15 19:09:51,439][thematos.models.base][INFO] - ==== Coherence : c_uci ====
[2023-08-15 19:09:51,440][thematos.models.base][INFO] - Average: 0.5407962501771852
[2023-08-15 19:09:51,440][thematos.models.base][INFO] - Per Topic: [0.7980955958033914, 0.6568717860517432, 0.37871382559844646, 0.24468810769433685, 0.24671037142181412, -0.5484771488713698, 0.2858446921351938, 0.5664955644072326, 0.4830281137375432, 1.3018222798657366, 0.39453938957949164, 0.6325481265318617, 0.37858146150510924, 1.7849392823267523, 0.7705559474225172, 0.9236258523830794, 0.13867486010456936, 0.6123869983666509, 0.2811976643458761, 0.4850822331337288]
[2023-08-15 19:09:51,893][thematos.models.base][INFO] - ==== Coherence : c_npmi ====
[2023-08-15 19:09:51,894][thematos.models.base][INFO] - Average: 0.066997714678115
[2023-08-15 19:09:51,894][thematos.models.base][INFO] - Per Topic: [0.08972189379830393, 0.08460459654168709, 0.04861142683257556, 0.03847374976484533, 0.027179217297177302, 0.00026794000017148797, 0.03754634335139229, 0.06217883791201524, 0.05328677116199519, 0.13792800009146242, 0.04825418436434792, 0.06778716593319205, 0.05224243061223199, 0.22305745521267398, 0.08509781648733664, 0.11051469073545851, 0.018417100260943384, 0.06588315923933655, 0.035715388916862274, 0.05318612504829075]
[2023-08-15 19:09:54,502][thematos.models.base][INFO] - ==== Coherence : c_v ====
[2023-08-15 19:09:54,503][thematos.models.base][INFO] - Average: 0.6176439623348415
[2023-08-15 19:09:54,503][thematos.models.base][INFO] - Per Topic: [0.7311489164829255, 0.7208846032619476, 0.5343140661716461, 0.5964014139026403, 0.5790759354829789, 0.48090492784976957, 0.5460225224494935, 0.7206535756587982, 0.6764586597681046, 0.7651297628879548, 0.5198783069849015, 0.5806703731417656, 0.4698037400841713, 0.5925369501113892, 0.7583050549030304, 0.6973470628261567, 0.5783749014139176, 0.6185191124677658, 0.5549376428127288, 0.6315117180347443]
[2023-08-15 19:09:54,554][thematos.models.base][INFO] - Model saved to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/models/LDA_model(4)_k(20).mdl
[2023-08-15 19:09:54,605][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/LDA_model(4)_k(20)-ll_per_word.csv
[2023-08-15 19:09:54,607][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.001788
[2023-08-15 19:09:55,011][thematos.models.base][INFO] - Log-likelihood per word plot saved to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/LDA_model(4)_k(20)-ll_per_word.png
[2023-08-15 19:09:55,079][thematos.models.base][INFO] - ==== Document-Topic Distributions ====
[2023-08-15 19:09:55,079][thematos.models.base][INFO] -           id    topic0    topic1  ...   topic17   topic18   topic19
27589  48239  0.000176  0.000412  ...  0.000091  0.000101  0.000102
27590  48216  0.000160  0.000375  ...  0.000083  0.000092  0.000093
27591  48132  0.000061  0.019791  ...  0.000032  0.000035  0.178651
27592  48115  0.258588  0.740238  ...  0.000063  0.000069  0.000070
27593  48085  0.000056  0.000132  ...  0.000029  0.000032  0.046560

[5 rows x 21 columns]
[2023-08-15 19:09:55,108][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/LDA_model(4)_k(20)-doc_topic_dists.parquet
[2023-08-15 19:09:55,272][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.164053
[2023-08-15 19:09:55,272][thematos.models.base][INFO] - ==== Topic-Word Distributions ====
[2023-08-15 19:09:55,273][thematos.models.base][INFO] -          add  increase   provide  ...        actual     elsewhere     seriously
15  0.005102  0.000574  0.002074  ...  6.625699e-04  5.167474e-08  9.678460e-04
16  0.002256  0.002083  0.004857  ...  3.056953e-08  2.804690e-04  3.048550e-04
17  0.002904  0.003924  0.006911  ...  4.072869e-04  1.013075e-04  5.078265e-08
18  0.003353  0.003373  0.005500  ...  1.029137e-03  6.048432e-04  3.891736e-08
19  0.005899  0.001628  0.003265  ...  8.575640e-04  4.465753e-08  7.463032e-04

[5 rows x 1102 columns]
[2023-08-15 19:09:55,373][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/LDA_model(4)_k(20)-topic_term_dists.parquet
[2023-08-15 19:09:55,499][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.125595
[2023-08-15 19:09:55,502][hyfi.utils.iolibs][INFO] - Save the list to the file: /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/LDA_model(4)_k(20)-used_vocab.txt, no. of words: 1102
[2023-08-15 19:09:55,505][hyfi.utils.iolibs][INFO] - Save the list to the file: /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/LDA_model(4)_k(20)-topic_top_words.txt, no. of words: 497
[2023-08-15 19:09:55,506][hyfi.utils.datasets.save][INFO] - Saving dataframe to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/LDA_model(4)_k(20)-topic_top_words_dists.csv
[2023-08-15 19:09:55,509][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.002590
[2023-08-15 19:10:10,546][thematos.models.base][INFO] - Making wordcloud collage with titles: ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9', 'Topic 10', 'Topic 11', 'Topic 12', 'Topic 13', 'Topic 14', 'Topic 15', 'Topic 16', 'Topic 17', 'Topic 18', 'Topic 19']
[2023-08-15 19:10:10,546][hyfi.graphics.collage][INFO] - Making page 1/1 with 20 images
[2023-08-15 19:10:10,546][hyfi.graphics.collage][INFO] - Page titles: ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9', 'Topic 10', 'Topic 11', 'Topic 12', 'Topic 13', 'Topic 14', 'Topic 15', 'Topic 16', 'Topic 17', 'Topic 18', 'Topic 19']
[2023-08-15 19:10:10,546][hyfi.graphics.collage][INFO] - Page output file: /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/wordcloud_collage/LDA_model(4)_k(20)_wordcloud_00.png
[2023-08-15 19:10:17,977][hyfi.graphics.utils][INFO] - Saved subplots to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/wordcloud_collage/LDA_model(4)_k(20)_wordcloud_00.png
[2023-08-15 19:10:17,978][thematos.models.base][WARNING] - pyLDAvis is not installed. Please install it to save LDAvis.
[2023-08-15 19:10:17,994][thematos.models.base][INFO] - Model summary saved to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/outputs/model-summary.jsonl
[2023-08-15 19:10:17,994][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/configs/model(4)_config.json
[2023-08-15 19:10:17,995][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/model/configs/model(4)_config.yaml
100%|████████████████████████████████████████████| 1/1 [02:26<00:00, 146.74s/it]
[2023-08-15 19:10:18,019][thematos.runners.topic][INFO] - Saved summaries to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/runner(1)_summaries.json
[2023-08-15 19:10:18,019][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/runner/configs/runner(1)_config.json
[2023-08-15 19:10:18,019][hyfi.composer.config][INFO] - Saving config to /home/yjlee/workspace/projects/nbcpu/workspace/nbcpu-topic_uncertainty/runner/configs/runner(1)_config.yaml

Model Results#

The specialized Latent Dirichlet Allocation (LDA) model, designed to classify uncertainty, was applied to a corpus of 27,594 documents containing 1,916,988 words. Out of 131,732 total vocabs, 1,102 were used in the analysis, with specific parameters set to focus on uncertainty-related topics.

Key Findings#

  1. Topic #0: This topic prominently features terms directly related to uncertainty, such as “risk,” “slow recovery,” “recession,” “uncertainty,” and economic fluctuations like “cut,” “hike,” “fall,” “increase,” and “decline.” It encapsulates the economic uncertainty and potential risks in recovery and growth.

  2. Topic #1: This topic aligns with the prior emphasizing improvement and strengthening. Terms like “improve,” “strengthen,” “ensure,” “achieve,” and “enhance” reflect efforts to mitigate uncertainty by enhancing frameworks, addressing goals, and implementing reforms.

  3. Topic Coherence Scores: The coherence scores, including u_mass at -2.0798, c_uci at 0.4653, c_npmi at 0.0657, and c_v at 0.5938, indicate a reasonable level of interpretability and relevance of the topics, although there may be room for further refinement.

Interpretation#

The results demonstrate the model’s effectiveness in identifying and classifying uncertainty-related topics. The incorporation of prior knowledge, removal of specific stop words, and fine-tuning of model parameters have led to the extraction of themes that resonate with the concept of uncertainty.

Topic #0 provides a comprehensive view of economic uncertainty, capturing the volatility and risks in the market. Topic #1, on the other hand, offers insights into strategies and efforts to navigate and mitigate uncertainty.

The tailored approach to topic modeling for uncertainty has yielded meaningful insights into the themes of risk, recovery, and strategies to overcome uncertainty. The model’s configuration and the resulting topics align well with the research objectives, providing a nuanced understanding of uncertainty within the given context. The coherence scores suggest that the topics are interpretable, although further refinement and exploration may enhance the model’s precision and depth of analysis.

Fig. 6 shows the wordcloud of the top 500 words in each topic from the LDA model with 20 topics and uncertainty prior.

../../_images/LDA_model%283%29_k%2820%29_wordcloud_00.png

Fig. 6 Wordcloud of the top 500 words in each topic from the LDA model with 20 topics and uncertainty prior.#

Refinement of Uncertainty Topic Modeling#

The refinement of uncertainty topic modeling is executed through a methodical two-stage process. Initially, documents are filtered according to specific criteria, and subsequently, the topic model is reapplied to this refined dataset. This iterative methodology sharpens the focus on themes pertinent to uncertainty, yielding a more accurate and detailed analysis.

In the first stage, documents are identified and filtered based on their relevance to uncertainty topics, specifically topics 0 and 1. The selection is guided by the combined weight of these topics, which directly correspond to the concept of uncertainty. Following this identification, the selected documents are merged with the original dataset.

The final stage of the process involves a further filtering of the data, adhering to the established selection criteria, and preparing the dataset for another round of topic modeling. This is achieved by applying the query "topic_relevant > 0.5", ensuring that only documents meeting this threshold are retained.

This systematic refinement process epitomizes a targeted approach to uncertainty topic modeling. It emphasizes the selection of documents that resonate with the themes of risk, uncertainty, and strategic mitigation. By employing this iterative and filtering technique, the analysis is honed in on the essential facets of uncertainty, thereby improving both the precision and interpretability of the findings. The configuration of the pipeline fortifies this approach, guaranteeing a methodical and replicable procedure that is in harmony with the overarching research goals and academic standards.

!nbcpu +workflow=nbcpu tasks='[nbcpu-datasets_uncertainty_filter]' mode=__info__
Hide code cell output
[2023-08-15 19:16:38,472][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7f27d820e610>
[2023-08-15 19:16:38,473][hyfi.main.config][INFO] - HyFi project [nbcpu] initialized
[2023-08-15 19:16:38,669][hyfi.main.main][INFO] - The HyFI config is not instantiatable, running HyFI task with the config
[2023-08-15 19:16:39,529][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7f27b8350760>
[2023-08-15 19:16:40,726][hyfi.workflow.workflow][INFO] - Running task [nbcpu-datasets_uncertainty_filter] with [run={} verbose=False uses='nbcpu-datasets_uncertainty_filter']
[2023-08-15 19:16:40,750][hyfi.task.task][INFO] - Running 1 pipeline(s)
[2023-08-15 19:16:40,750][hyfi.task.task][INFO] - Running pipeline: nbcpu-datasets_uncertainty_filter
[2023-08-15 19:16:40,762][hyfi.task.task][INFO] - Applying 6 pipes: [{'_target_': 'hyfi.utils.datasets.load.DSLoad.load_dataframes', 'data_files': 'nbcpu-topic_uncertainty/model/outputs/LDA_model(3)_k(20)-doc_topic_dists.parquet', 'data_dir': None, 'filetype': None, 'split': None, 'concatenate': False, 'ignore_index': False, 'use_cached': False, 'verbose': True}, {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns', 'expressions': {'topic_relevant': 'topic0 + topic1'}, 'engine': 'python', 'verbose': True}, {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_select_columns', 'columns': ['id', 'topic_relevant'], 'verbose': True}, {'_target_': 'hyfi.utils.datasets.combine.DSCombine.merge_dataframes', 'right': 'datasets/processed/khmer_tokenized/train.parquet', 'how': 'inner', 'on': 'id', 'left_on': None, 'right_on': None, 'left_index': False, 'right_index': False, 'sort': False, 'suffixes': ['_x', '_y'], 'copy': True, 'indicator': False, 'validate': None, 'verbose': True}, {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail', 'num_heads': 5, 'num_tails': 5, 'columns': ['id', 'text', 'tokens', 'topic_relevant'], 'verbose': True}, {'_target_': 'hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data', 'queries': ['topic_relevant > 0.5'], 'sample_size': None, 'sample_seed': 42, 'output_dir': 'datasets/processed/topic_uncertainty_filtered', 'sample_filename': None, 'train_filename': 'train.parquet', 'discard_filename': 'discard.parquet', 'returning_data': 'train', 'verbose': True}]
 Change directory to /home/yjlee/workspace/projects/nbcpu/workspace
[2023-08-15 19:16:40,767][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.load.DSLoad.load_dataframes with kwargs: {'_target_': 'hyfi.utils.datasets.load.DSLoad.load_dataframes', 'data_files': 'nbcpu-topic_uncertainty/model/outputs/LDA_model(3)_k(20)-doc_topic_dists.parquet', 'data_dir': None, 'filetype': None, 'split': None, 'concatenate': False, 'ignore_index': False, 'use_cached': False, 'verbose': True}
[2023-08-15 19:16:40,767][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.load.DSLoad.load_dataframes ...
[2023-08-15 19:16:40,770][hyfi.utils.iolibs][INFO] - Processing [1] files from ['nbcpu-topic_uncertainty/model/outputs/LDA_model(3)_k(20)-doc_topic_dists.parquet']
[2023-08-15 19:16:40,770][hyfi.utils.datasets.load][INFO] - Loading data from nbcpu-topic_uncertainty/model/outputs/LDA_model(3)_k(20)-doc_topic_dists.parquet
[2023-08-15 19:16:40,791][hyfi.utils.datasets.load][INFO] -  >> elapsed time to load data: 0:00:00.020824
[2023-08-15 19:16:40,793][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns with kwargs: {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns', 'expressions': {'topic_relevant': 'topic0 + topic1'}, 'engine': 'python', 'verbose': True}
[2023-08-15 19:16:40,793][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns ...
[2023-08-15 19:16:40,795][hyfi.utils.datasets.basic][INFO] - Evaluating column topic_relevant
[2023-08-15 19:16:40,801][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.basic.DSBasic.dataframe_select_columns with kwargs: {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_select_columns', 'columns': ['id', 'topic_relevant'], 'verbose': True}
[2023-08-15 19:16:40,802][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.basic.DSBasic.dataframe_select_columns ...
[2023-08-15 19:16:40,803][hyfi.utils.datasets.basic][INFO] - Selecting columns: ['id', 'topic_relevant']
[2023-08-15 19:16:40,808][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.combine.DSCombine.merge_dataframes with kwargs: {'_target_': 'hyfi.utils.datasets.combine.DSCombine.merge_dataframes', 'right': 'datasets/processed/khmer_tokenized/train.parquet', 'how': 'inner', 'on': 'id', 'left_on': None, 'right_on': None, 'left_index': False, 'right_index': False, 'sort': False, 'suffixes': ['_x', '_y'], 'copy': True, 'indicator': False, 'validate': None, 'verbose': True}
[2023-08-15 19:16:40,809][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.combine.DSCombine.merge_dataframes ...
[2023-08-15 19:16:40,812][hyfi.utils.datasets.combine][INFO] - Merging dataframes
[2023-08-15 19:16:44,247][hyfi.pipeline.config][INFO] - Running a pipe with hyfi.pipe.general_external_funcs
[2023-08-15 19:16:44,248][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail with kwargs: {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail', 'num_heads': 5, 'num_tails': 5, 'columns': ['id', 'text', 'tokens', 'topic_relevant'], 'verbose': True}
[2023-08-15 19:16:44,249][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail ...
[2023-08-15 19:16:44,251][hyfi.utils.datasets.basic][INFO] - Printing head and tail of dataframe
[2023-08-15 19:16:44,251][hyfi.utils.datasets.basic][INFO] - Head:
          id  ... topic_relevant
0  501330943  ...       0.341887
1  501326169  ...       0.001473
2  501325554  ...       0.000277
3  501322478  ...       0.000366
4  501321914  ...       0.999255

[5 rows x 4 columns]
[2023-08-15 19:16:44,260][hyfi.utils.datasets.basic][INFO] - Tail:
          id  ... topic_relevant
27605  48239  ...       0.000566
27606  48216  ...       0.000515
27607  48132  ...       0.000197
27608  48115  ...       0.572823
27609  48085  ...       0.000182

[5 rows x 4 columns]
[2023-08-15 19:16:44,272][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data with kwargs: {'_target_': 'hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data', 'queries': ['topic_relevant > 0.5'], 'sample_size': None, 'sample_seed': 42, 'output_dir': 'datasets/processed/topic_uncertainty_filtered', 'sample_filename': None, 'train_filename': 'train.parquet', 'discard_filename': 'discard.parquet', 'returning_data': 'train', 'verbose': True}
[2023-08-15 19:16:44,273][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data ...
[2023-08-15 19:16:44,288][hyfi.utils.datasets.slice][INFO] - filtering data by topic_relevant > 0.5
[2023-08-15 19:16:44,292][hyfi.utils.datasets.slice][INFO] - filtered 20693 documents
[2023-08-15 19:16:44,295][hyfi.utils.datasets.save][INFO] - Saving dataframe to /raid/cis/yjlee/workspace/projects/nbcpu/workspace/datasets/processed/topic_uncertainty_filtered/train.parquet
[2023-08-15 19:16:48,417][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:04.121659
           id  ...                                         predicates
4   501321914  ...  [average, supply, drop, significantly, may, lo...
5   501320777  ...  [see, significant, mobile, financial, introduc...
10  501272597  ...  [french, national, nbc, establish, gain, print...
11  501258270  ...  [french, national, nbc, establish, gain, print...
12  501092971  ...  [european, central, ecb, become, great, could,...

[5 rows x 12 columns]
[2023-08-15 19:16:48,438][hyfi.utils.datasets.save][INFO] - Saving dataframe to /raid/cis/yjlee/workspace/projects/nbcpu/workspace/datasets/processed/topic_uncertainty_filtered/discard.parquet
[2023-08-15 19:17:08,321][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:19.882527
          id  ...                                         predicates
0  501330943  ...  [national, ask, financial, consider, provide, ...
1  501326169  ...  [national, prepare, standardised, common, fina...
2  501325554  ...  [national, prepare, standardised, common, fina...
3  501322478  ...  [national, nbc, sign, chinese, international, ...
6  501319461  ...  [national, launch, share, enable, financial, m...

[5 rows x 12 columns]
[2023-08-15 19:17:08,338][hyfi.utils.datasets.slice][INFO] - Created 0 samples, 6917 train samples, and 20693 discard samples
 Change directory back to /raid/cis/yjlee/workspace/logs/hydra/nbcpu/2023-08-15/2023-08-15_19-16-37
[2023-08-15 19:17:08,554][hyfi.task.task][INFO] -  >> elapsed time for the task with 1 pipelines: 0:00:27.803055