Supplementary Materials#
The research code is publicly available as a Python package named nbcpu
, hosted on GitHub and published on PyPI. Built on the Hydra Fast Interface (HyFI) framework and integrated with plugins ThematOS and Lexikanon, nbcpu
offers a streamlined command-line interface for replicating the experiments described in this research.
Installation#
nbcpu
requires Python 3.8 or higher and can be installed using the following command:
pip install -U nbcpu
Usage#
Overview#
The nbcpu
package is designed to facilitate the entire workflow of the research, encompassing crawling, processing, topic modeling, and analysis. The interface is structured into four main parts, each corresponding to a specific phase of the research.
Configuration#
The configuration for nbcpu
is located in the src/nbcpu/conf
directory and is divided into several sub-configurations. The main configuration file, nbcpu
, orchestrates the entire workflow and includes the following sections:
Defaults: Specifies the default configurations for various tasks.
Tasks: Lists the tasks to be executed, such as fetching data, filtering datasets, and running topic models.
Global Settings: Defines global parameters like the number of workers and paths to datasets and workspace.
A snippet of the configuration file is provided below:
## @package _global_
defaults:
- __init__
- /fetcher@khmer_all: khmer_all
- /task@nbcpu-datasets: nbcpu-datasets
- /runner@nbcpu-topic_noprior: nbcpu-topic_noprior
- /task@nbcpu-datasets_noprior_filter: nbcpu-datasets_noprior_filter
- /runner@nbcpu-topic_prior: nbcpu-topic_prior
- /runner@nbcpu-topic_uncertainty: nbcpu-topic_uncertainty
- /task@nbcpu-datasets_uncertainty_filter: nbcpu-datasets_uncertainty_filter
- /runner@nbcpu-topic_uncertainty_filtered: nbcpu-topic_uncertainty_filtered
- override /project: nbcpu
_config_name_: nbcpu
verbose: false
tasks:
- khmer_all
- nbcpu-datasets
- nbcpu-topic_noprior
- nbcpu-datasets_noprior_filter
- nbcpu-topic_prior
- nbcpu-topic_uncertainty
- nbcpu-datasets_uncertainty_filter
- nbcpu-topic_uncertainty_filtered
nbcpu-topic_uncertainty_filtered:
calls:
- train
- infer
infer_args:
model_config_file: ${oc.select:variables.project_root_path,.}/workspace/nbcpu-topic_uncertainty_filtered/model/configs/model(2)_config.yaml
corpus_to_infer:
text_col: tokens
data_load:
data_file: ${oc.select:variables.project_root_path,.}/workspace/datasets/processed/topic_noprior_filtered/train.parquet
global:
num_workers: 100
datasets_path: ${__get_path__:datasets}
workspace_path: ${__project_workspace_path__:}
Running Experiments#
The entire workflow can be executed with the following command:
nbcpu +workflow=nbcpu
This command triggers the sequence of tasks defined in the configuration, including data fetching, preprocessing, topic modeling, and uncertainty analysis.
Detailed Workflow#
Crawling: Fetches the required data based on the specified configuration.
Processing: Includes preprocessing tasks such as filtering datasets based on specific criteria.
Topic Modeling: Runs different topic modeling tasks, including those with and without prior information.
Analysis: Conducts analysis on the derived topics, including uncertainty filtering and inference.
The package also contains the code for the experiments, serving as a reference for implementing similar research.