# Processing the Data

The processing of the data crawled from Khmer Times involves several critical steps, including normalization, lemmatization, tokenization, and extraction of specific parts of speech. This section provides a comprehensive overview of the entire process, detailing the configuration and workflow for tokenization.


## 1. Normalization and Lemmatization

Before tokenization, the raw data is normalized and lemmatized to ensure consistency and reduce complexity. The normalization process includes fixing inconsistent UTF-8 encoding, character width, line breaks, and other text-related issues. Lemmatization reduces words to their base or root form, ensuring that different forms of a word are treated as a single item.

## 2. Tokenization

Tokenization is the process of breaking down the text into individual tokens or words. The configuration for tokenization includes various parameters to control the process:

- **Tokenizer Settings:** Utilizes the NLTK tokenizer, with options to include whitespace tokens, convert to lowercase, and specify part-of-speech tags for nouns, verbs, adjectives, adverbs, etc.
- **Normalizer Settings:** Includes options to fix various text issues, such as inconsistent encoding, character width, line breaks, and special characters.
- **Stopwords Settings:** Defines the rules for removing common words that may not contribute to the analysis, such as pronouns, conjunctions, determiners, etc.
- **Tagger Settings:** Utilizes the NLTK tagger for part-of-speech tagging, with options for lemmatization and stemming.

The detailed configuration for tokenization is shown below:

```yaml
defaults:
  - nltk_universal

stopwords:
  stopwords_fn: "lambda x: len(x) <= 2"
  # stopwords_list:
  nltk_stopwords_lang: english
  stopwords_path: ${__project_root_path__:}/tests/assets/stopwords/nbcpu-tokenizer.txt
  verbose: false
lowercase: true
include_whitespace_token: false
strip_pos: false
tagger:
  stem: false
  lemmatize: true
verbose: false
```

You can also use the `nbcpu +tokenizer=nbcpu` command to see the default configuration. The `dryrun=true` option is used to print the configuration without running the tokenizer.

```bash
nbcpu +tokenizer=nbcpu dryrun=true
```


## 3. Tokenization Workflow

The tokenization workflow is defined in the `src/nbcpu/conf/pipeline` directory and consists of several steps:

- **Loading Dataset:** Loads the dataset from the specified path in the desired format (e.g., parquet).
- **Tokenizing Dataset:** Applies the specified tokenizer to the text column, creating a new tokenized column.
- **Extracting Tokens:** Extracts specific tokens based on part-of-speech tags, such as nouns and adjectives for Topic Classification Model 1, and additional verbs and adverbs for Model 2.
- **Converting to Pandas DataFrame:** Transforms the dataset into a Pandas DataFrame for further processing.
- **Evaluating and Filtering Data:** Applies expressions to evaluate specific columns, filters data based on queries, and samples the data if needed.
- **Printing Head and Tail:** Prints the beginning and end of the DataFrame for verification.

The detailed YAML configuration for the tokenization workflow is shown below:

```yaml
defaults:
  - nbcpu-datasets

steps:
  - uses: load_dataset
    with:
      data_files: datasets/raw/khmer_articles.parquet
      path: parquet
      split: train
    verbose: true
  - uses: tokenize_dataset
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      text_col: text
      token_col: tokenized
      load_from_cache_file: false
      verbose: true
  - uses: extract_tokens
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      token_col: tokenized
      extracted_col: tokens
      strip_pos: true
      load_from_cache_file: false
      verbose: true
  - uses: extract_tokens
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      token_col: tokenized
      extracted_col: adjnouns
      postags:
        - ADJ
        - NOUN
      strip_pos: true
      load_from_cache_file: false
      verbose: true
  - uses: extract_tokens
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      token_col: tokenized
      extracted_col: predicates
      postags:
        - VERB
        - PRT
        - ADJ
        - ADV
      strip_pos: true
      load_from_cache_file: false
      verbose: true
  - uses: dataset_to_pandas
    verbose: true
  - uses: dataframe_eval_columns
    with:
      expressions:
        id: "url.str.split('/').str[3]"
      verbose: true
  - uses: filter_and_sample_data
    with:
      queries:
        - "tokens.str.len() > 50"
      output_dir: datasets/processed/khmer_tokenized
      train_filename: train.parquet
      discard_filename: discard.parquet
      verbose: true
  - uses: dataframe_print_head_and_tail
    with:
      columns: [id, text, tokens, adjnouns, predicates]
      verbose: true
    verbose: true
```


## Execution

For the actual execution of the tokenization and processing workflow, you can run the following command:

```bash
!nbcpu +workflow=nbcpu tasks='[nbcpu-datasets]' mode=__info__
```

This command initiates the entire processing workflow as defined in the configuration, including normalization, lemmatization, tokenization, and extraction of specific parts of speech. The `mode=__info__` part of the command provides informational output, allowing you to monitor the progress and verify the process.

By executing this command, you will transform the raw data crawled from Khmer Times into a structured and consistent format, ready for topic modeling and further analysis.


In [4]:
!nbcpu +workflow=nbcpu tasks='[nbcpu-datasets]' mode=__info__

[[36m2023-08-15 16:02:15,053[0m][[34mhyfi.joblib.joblib[0m][[32mINFO[0m] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7fae180975e0>[0m
[[36m2023-08-15 16:02:15,053[0m][[34mhyfi.main.config[0m][[32mINFO[0m] - HyFi project [nbcpu] initialized[0m
[[36m2023-08-15 16:02:15,245[0m][[34mhyfi.main.main[0m][[32mINFO[0m] - The HyFI config is not instantiatable, running HyFI task with the config[0m
[[36m2023-08-15 16:02:16,077[0m][[34mhyfi.joblib.joblib[0m][[32mINFO[0m] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7fae18032eb0>[0m
[[36m2023-08-15 16:02:17,318[0m][[34mhyfi.workflow.workflow[0m][[32mINFO[0m] - Running task [nbcpu-datasets] with [run={} verbose=False uses='nbcpu-datasets'][0m
[[36m2023-08-15 16:02:17,346[0m][[34mhyfi.task.task[0m][[32mINFO[0m] - Running 1 pipeline(s)[0m
[[36m2023-08-15 16:02:17,346[0m][[34mhyfi.task.task[0m][[32mINFO[0m] - Running pipeline: nbcpu-datasets_tokeni