Topic Modeling with Prior#

In the second stage of the analysis, the research employs a topic modeling approach with prior information to refine the topics pertinent to the study of central bank policy uncertainty in Cambodia’s highly dollarized economy. The topic modeling with prior represents a sophisticated approach to distilling relevant information from a corpus of text data. By incorporating prior knowledge, the model is tailored to capture the nuances of central bank policy uncertainty in the specific context of Cambodia’s economy. This method ensures that the derived topics are aligned with the research objectives, providing a robust foundation for subsequent analysis and interpretation.

Prior Information#

The prior information is set to guide the topic modeling towards specific themes relevant to central bank policy uncertainty. The prior consists of two main groups:

  • Group 0: Focuses on general economic indicators, including terms like ‘price’, ‘inflation’, ‘growth’, and ‘economy’.

  • Group 1: Concentrates on central banking aspects, with terms such as ‘nbc’, ‘central_bank’, ‘national_bank’, and ‘national_bank_cambodia’.


The configuration of the topic modeling with prior is as follows:

Running the Workflow#

The entire workflow can be executed using the following command:

!nbcpu +workflow=nbcpu tasks='[nbcpu-topic_prior]' mode=__info__
[2023-08-15 18:40:06,185][thematos.models.base][INFO] - Set words ['price', 'growth', 'inflation', 'economy'] to topic #0 as prior.
[2023-08-15 18:40:46,788][thematos.models.base][INFO] - Set words ['national_bank_cambodia', 'central_bank', 'nbc', 'national_bank'] to topic #1 as prior.
[2023-08-15 18:40:47,560][thematos.models.lda][INFO] - Number of docs: 27594
[2023-08-15 18:40:47,560][thematos.models.lda][INFO] - Vocab size: 18158
[2023-08-15 18:40:47,561][thematos.models.lda][INFO] - Number of words: 4810963
[2023-08-15 18:40:47,561][thematos.models.lda][INFO] - Removed top words: []
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.10728438 0.06657154 0.07608083 0.06004212 0.06778088 0.11082149
|   0.06340487 0.05877907 0.05908309 0.05704119]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
| #0 (742782) : price market growth inflation rate economy oil share dollar global stock low index investor trade high export bank debt trading
| #1 (619991) : bank digital customer loan payment banking nbc service financial business insurance company cambodia riel mobile transaction technology smes branch financial_institution
| #2 (403293) : energy india power climate world green climate_change emission technology coal solar global company way malaysia system data renewable_energy human research
| #3 (313718) : tourism tourist chinese airport city hotel visitor flight cambodia passenger province siem_reap angkor travel destination airline temple sihanoukville phnom_penh cambodian
| #4 (421016) : worker factory project export port garment road construction company cambodia investment logistics labour ministry union sector electricity japanese vehicle industry
| #5 (772618) : cambodia tax development asean cooperation economic trade policy government investment woman minister law agreement cambodian social meeting sector sustainable member
| #6 (401326) : election party political military myanmar democracy leader state human_right government opposition protest bangladesh former khmer_rouge right pakistan politics power india
| #7 (386550) : rice farmer water land project agriculture river food fish tonne property province area forest agricultural community crop village price fishery
| #8 (395547) : police court victim commune party drug cnrp prison cpp district law money case election judge provincial suspect crime phnom_penh complaint
| #9 (354122) : health covid-19 vaccine case virus child pandemic vaccination infection huawei disease medical omicron outbreak community patient hospital covid quarantine death

[2023-08-15 18:41:25,291][thematos.models.base][INFO] - ==== Coherence : u_mass ====
[2023-08-15 18:41:25,291][thematos.models.base][INFO] - Average: -1.6402675149726402
[2023-08-15 18:41:25,292][thematos.models.base][INFO] - Per Topic: [-1.425266578567007, -1.0729033256163611, -1.9737517568414404, -1.6950357738440491, -1.4372304993750558, -1.0894857078690348, -1.3641567969557398, -1.9651203444009737, -2.0280244511223575, -2.3516999151343816]
[2023-08-15 18:41:25,911][thematos.models.base][INFO] - ==== Coherence : c_uci ====
[2023-08-15 18:41:25,911][thematos.models.base][INFO] - Average: 0.7173903977549194
[2023-08-15 18:41:25,911][thematos.models.base][INFO] - Per Topic: [0.631636197256756, 1.0372400239549309, 1.0279241224565632, 1.1587594141923576, 0.4124652574389333, 0.3280867476153075, 0.9620709742174812, 0.8812465779074348, 0.9351462283826499, -0.20067156587322113]
[2023-08-15 18:41:26,612][thematos.models.base][INFO] - ==== Coherence : c_npmi ====
[2023-08-15 18:41:26,613][thematos.models.base][INFO] - Average: 0.10520817835902437
[2023-08-15 18:41:26,613][thematos.models.base][INFO] - Per Topic: [0.09169010689655911, 0.1513512764679383, 0.11547400424216366, 0.1390206534657676, 0.06185781111793033, 0.05339414794065087, 0.11865170803328073, 0.10947131563054402, 0.12918262344786416, 0.08198813634754491]
[2023-08-15 18:41:30,242][thematos.models.base][INFO] - ==== Coherence : c_v ====
[2023-08-15 18:41:30,242][thematos.models.base][INFO] - Average: 0.7093467557430267
[2023-08-15 18:41:30,242][thematos.models.base][INFO] - Per Topic: [0.7229848951101303, 0.8240539312362671, 0.6891917288303375, 0.7377653867006302, 0.5392432540655137, 0.5775677382946014, 0.7515179306268692, 0.7305361032485962, 0.787589567899704, 0.7330170214176178]
[2023-08-15 18:41:30,824][thematos.models.base][INFO] - ==== Document-Topic Distributions ====
[2023-08-15 18:41:30,825][thematos.models.base][INFO] -           id    topic0    topic1  ...    topic7    topic8    topic9
27589  48239  0.833621  0.000136  ...  0.000120  0.083921  0.000116
27590  48216  0.086417  0.000143  ...  0.133420  0.000127  0.000123
27591  48132  0.000117  0.000073  ...  0.000064  0.319019  0.000062
27592  48115  0.070176  0.000184  ...  0.000163  0.000164  0.000158
27593  48085  0.000108  0.169788  ...  0.000059  0.000060  0.000058

[2023-08-15 18:41:30,933][thematos.models.base][INFO] - ==== Topic-Word Distributions ====
[2023-08-15 18:41:30,938][thematos.models.base][INFO] -        cambodia  government  ...  rcep_cambodia-china_free  asia_pacific_region
5  7.159818e-03    0.003812  ...              4.491441e-09         4.491441e-09
6  6.598678e-09    0.002988  ...              6.515646e-09         6.515646e-09
7  1.485342e-03    0.001250  ...              7.418155e-09         7.418155e-09
8  2.527701e-04    0.000561  ...              6.831890e-09         6.832542e-09
9  2.897601e-03    0.001661  ...              8.522984e-09         4.052390e-05

Model Results#

The Latent Dirichlet Allocation (LDA) model, applied to a corpus of 27,594 documents encompassing 4,810,963 words, utilized 18,158 out of 126,469 total vocabs. Configured with 10 topics and specific hyperparameters for alpha and eta, the model underwent 100 iterations without burn-in steps, with an optimization interval of 10.

The resultant topics reflect a diverse spectrum of subjects, including economics, finance, politics, technology, and social matters. Through the incorporation of prior information, the model was steered towards themes pertinent to central bank policy and economic indicators. Specifically:

  • Topic #0 aligns with the economic indicators’ prior, emphasizing price, market growth, inflation, and global economic aspects.

  • Topic #1 resonates with the banking sector, including central banking, mirroring the prior set for banking and financial services.

  • Topic #7 encapsulates trade, investment, and regional cooperation facets, including ASEAN, indirectly correlating with economic policy and central banking.

These topics collectively illustrate the model’s adeptness in leveraging prior information to hone in on areas of interest such as central banking and economic indicators. The coherence of the topics underscores their interpretability and relevance to the overarching research focus.

Fig. 5 shows the wordcloud of the top 500 words in each topic from the LDA model with 10 topics and prior. The size of the word is proportional to the frequency of the word in the topic.


Fig. 5 Wordcloud of the top 500 words in each topic from the LDA model with 10 topics and prior. The size of the word is proportional to the frequency of the word in the topic.#