Processing the Data

Processing the Data#

The processing of the data crawled from Khmer Times involves several critical steps, including normalization, lemmatization, tokenization, and extraction of specific parts of speech. This section provides a comprehensive overview of the entire process, detailing the configuration and workflow for tokenization.

1. Normalization and Lemmatization#

Before tokenization, the raw data is normalized and lemmatized to ensure consistency and reduce complexity. The normalization process includes fixing inconsistent UTF-8 encoding, character width, line breaks, and other text-related issues. Lemmatization reduces words to their base or root form, ensuring that different forms of a word are treated as a single item.

2. Tokenization#

Tokenization is the process of breaking down the text into individual tokens or words. The configuration for tokenization includes various parameters to control the process:

Tokenizer Settings: Utilizes the NLTK tokenizer, with options to include whitespace tokens, convert to lowercase, and specify part-of-speech tags for nouns, verbs, adjectives, adverbs, etc.
Normalizer Settings: Includes options to fix various text issues, such as inconsistent encoding, character width, line breaks, and special characters.
Stopwords Settings: Defines the rules for removing common words that may not contribute to the analysis, such as pronouns, conjunctions, determiners, etc.
Tagger Settings: Utilizes the NLTK tagger for part-of-speech tagging, with options for lemmatization and stemming.

The detailed configuration for tokenization is shown below:

defaults:
  - nltk_universal

stopwords:
  stopwords_fn: "lambda x: len(x) <= 2"
  # stopwords_list:
  nltk_stopwords_lang: english
  stopwords_path: ${__project_root_path__:}/tests/assets/stopwords/nbcpu-tokenizer.txt
  verbose: false
lowercase: true
include_whitespace_token: false
strip_pos: false
tagger:
  stem: false
  lemmatize: true
verbose: false

You can also use the nbcpu +tokenizer=nbcpu command to see the default configuration. The dryrun=true option is used to print the configuration without running the tokenizer.

nbcpu +tokenizer=nbcpu dryrun=true

3. Tokenization Workflow#

The tokenization workflow is defined in the src/nbcpu/conf/pipeline directory and consists of several steps:

Loading Dataset: Loads the dataset from the specified path in the desired format (e.g., parquet).
Tokenizing Dataset: Applies the specified tokenizer to the text column, creating a new tokenized column.
Extracting Tokens: Extracts specific tokens based on part-of-speech tags, such as nouns and adjectives for Topic Classification Model 1, and additional verbs and adverbs for Model 2.
Converting to Pandas DataFrame: Transforms the dataset into a Pandas DataFrame for further processing.
Evaluating and Filtering Data: Applies expressions to evaluate specific columns, filters data based on queries, and samples the data if needed.
Printing Head and Tail: Prints the beginning and end of the DataFrame for verification.

The detailed YAML configuration for the tokenization workflow is shown below:

defaults:
  - nbcpu-datasets

steps:
  - uses: load_dataset
    with:
      data_files: datasets/raw/khmer_articles.parquet
      path: parquet
      split: train
    verbose: true
  - uses: tokenize_dataset
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      text_col: text
      token_col: tokenized
      load_from_cache_file: false
      verbose: true
  - uses: extract_tokens
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      token_col: tokenized
      extracted_col: tokens
      strip_pos: true
      load_from_cache_file: false
      verbose: true
  - uses: extract_tokens
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      token_col: tokenized
      extracted_col: adjnouns
      postags:
        - ADJ
        - NOUN
      strip_pos: true
      load_from_cache_file: false
      verbose: true
  - uses: extract_tokens
    with:
      tokenizer: nbcpu
      num_workers: ${oc.select:variables.num_workers,1}
      token_col: tokenized
      extracted_col: predicates
      postags:
        - VERB
        - PRT
        - ADJ
        - ADV
      strip_pos: true
      load_from_cache_file: false
      verbose: true
  - uses: dataset_to_pandas
    verbose: true
  - uses: dataframe_eval_columns
    with:
      expressions:
        id: "url.str.split('/').str[3]"
      verbose: true
  - uses: filter_and_sample_data
    with:
      queries:
        - "tokens.str.len() > 50"
      output_dir: datasets/processed/khmer_tokenized
      train_filename: train.parquet
      discard_filename: discard.parquet
      verbose: true
  - uses: dataframe_print_head_and_tail
    with:
      columns: [id, text, tokens, adjnouns, predicates]
      verbose: true
    verbose: true

Execution#

For the actual execution of the tokenization and processing workflow, you can run the following command:

!nbcpu +workflow=nbcpu tasks='[nbcpu-datasets]' mode=__info__

This command initiates the entire processing workflow as defined in the configuration, including normalization, lemmatization, tokenization, and extraction of specific parts of speech. The mode=__info__ part of the command provides informational output, allowing you to monitor the progress and verify the process.

By executing this command, you will transform the raw data crawled from Khmer Times into a structured and consistent format, ready for topic modeling and further analysis.

Show code cell content Hide code cell content

!nbcpu +workflow=nbcpu tasks='[nbcpu-datasets]' mode=__info__

[2023-08-15 16:02:15,053][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7fae180975e0>
[2023-08-15 16:02:15,053][hyfi.main.config][INFO] - HyFi project [nbcpu] initialized
[2023-08-15 16:02:15,245][hyfi.main.main][INFO] - The HyFI config is not instantiatable, running HyFI task with the config
[2023-08-15 16:02:16,077][hyfi.joblib.joblib][INFO] - initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7fae18032eb0>
[2023-08-15 16:02:17,318][hyfi.workflow.workflow][INFO] - Running task [nbcpu-datasets] with [run={} verbose=False uses='nbcpu-datasets']
[2023-08-15 16:02:17,346][hyfi.task.task][INFO] - Running 1 pipeline(s)
[2023-08-15 16:02:17,346][hyfi.task.task][INFO] - Running pipeline: nbcpu-datasets_tokenize
[2023-08-15 16:02:17,368][hyfi.task.task][INFO] - Applying 10 pipes: [{'_target_': 'hyfi.utils.datasets.load.DSLoad.load_dataset', 'path': 'parquet', 'name': None, 'data_dir': None, 'data_files': 'datasets/raw/khmer_articles.parquet', 'split': 'train', 'cache_dir': None, 'features': None, 'download_config': None, 'download_mode': None, 'verification_mode': None, 'num_proc': None}, {'_target_': 'hyfi.utils.datasets.slice.DSSlice.sample_dataset', 'split': None, 'sample_size': 1000, 'sample_seed': 42, 'randomize': True, 'num_heads': 1, 'num_tails': 1, 'verbose': True}, {'_target_': 'lexikanon.pipes.tokenize.tokenize_dataset', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'text_col': 'text', 'token_col': 'tokenized', 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}, {'_target_': 'lexikanon.pipes.tokenize.extract_tokens', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'token_col': 'tokenized', 'extracted_col': 'tokens', 'nouns_only': False, 'postags': None, 'stop_postags': None, 'strip_pos': True, 'postag_delim': None, 'postag_length': None, 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}, {'_target_': 'lexikanon.pipes.tokenize.extract_tokens', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'token_col': 'tokenized', 'extracted_col': 'adjnouns', 'nouns_only': False, 'postags': ['ADJ', 'NOUN'], 'stop_postags': None, 'strip_pos': True, 'postag_delim': None, 'postag_length': None, 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}, {'_target_': 'lexikanon.pipes.tokenize.extract_tokens', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'token_col': 'tokenized', 'extracted_col': 'predicates', 'nouns_only': False, 'postags': ['VERB', 'PRT', 'ADJ', 'ADV'], 'stop_postags': None, 'strip_pos': True, 'postag_delim': None, 'postag_length': None, 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}, {'_target_': 'to_pandas'}, {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns', 'expressions': {'id': "url.str.split('/').str[3]"}, 'engine': 'python', 'verbose': True}, {'_target_': 'hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data', 'queries': ['tokens.str.len() > 50'], 'sample_size': None, 'sample_seed': 42, 'output_dir': 'datasets/processed/khmer_tokenized_sample', 'sample_filename': None, 'train_filename': 'train.parquet', 'discard_filename': 'discard.parquet', 'returning_data': 'train', 'verbose': True}, {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail', 'num_heads': 5, 'num_tails': 5, 'columns': ['id', 'text', 'tokens', 'adjnouns', 'predicates'], 'verbose': True}]
 Change directory to /home/yjlee/workspace/projects/nbcpu/workspace
[2023-08-15 16:02:17,369][hyfi.pipeline.config][INFO] - Running a pipe with hyfi.pipe.general_external_funcs
[2023-08-15 16:02:17,371][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.load.DSLoad.load_dataset with kwargs: {'_target_': 'hyfi.utils.datasets.load.DSLoad.load_dataset', 'path': 'parquet', 'name': None, 'data_dir': None, 'data_files': 'datasets/raw/khmer_articles.parquet', 'split': 'train', 'cache_dir': None, 'features': None, 'download_config': None, 'download_mode': None, 'verification_mode': None, 'num_proc': None}
[2023-08-15 16:02:18,300][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.slice.DSSlice.sample_dataset with kwargs: {'_target_': 'hyfi.utils.datasets.slice.DSSlice.sample_dataset', 'split': None, 'sample_size': 1000, 'sample_seed': 42, 'randomize': True, 'num_heads': 1, 'num_tails': 1, 'verbose': True}
[2023-08-15 16:02:18,301][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.slice.DSSlice.sample_dataset ...
[2023-08-15 16:02:18,310][hyfi.utils.datasets.slice][INFO] - Sampling done.
{'url': ['https://www.khmertimeskh.com/743741/thaicom-cat-tie-up-for-low-earth-orbit-satellite-services-for-cambodia-and-other-countries/'], 'keyword': ['Exchange+Rate'], 'title': ['Thaicom-CAT tie-up for low earth orbit satellite services for Cambodia and other countries'], 'categories': [['Business']], 'time': [datetime.datetime(2020, 7, 11, 1, 20, 57)], 'text': ['SET-listed Thaicom says a joint venture it recently formed with CAT Telecom would serve as a hub for low earth orbit (LEO) satellite service in the Cambodia, Laos, Myanmar and Vietnam (CLMV) market, catering to demand for high-speed internet via 5G tech and innovative applications.\nThe establishment of the joint venture, called Nation Space and Technology Co, was reported by Thaicom to the Stock Exchange of Thailand late last month.\nIn the JV with registered capital of 10 million baht, Thaicom holds a 75% stake, while CAT holds 25%.\nThe JV is aimed at providing satellite gateway services and solutions as well as marketing the sale of LEO satellites.\nThaicom chief executive Anant Kaewruamvongs said LEO satellites operate between 500 and 2,000 kilometres above the earth’s surface, compared to 36,000km for geostationary satellites, a traditional type of communication satellite.\nHe said the advantage of LEO satellites is a low latency signal, which could benefit people with access to high-speed internet services via 5G tech as well as the usage of Internet of Things devices, machine-to-machine tech, drones and applications that require high levels of accuracy, such as remote surgery.\nLEO satellite projects are operated by two main companies: Space Exploration Technologies Corp (SpaceX) and London-based satellite internet access provider OneWeb.\nLEO satellites have been launched and commercial service is expected from next year.\nMr Anant said related LEO businesses are going to be the focus of the JV.\n“Thaicom’s strength lies in global communication and network system management as well as marketing and sales channels, while CAT Telecom’s strengths are gateway service, submarine fibre and telecom networks,” he said.\nThe partnership should enhance the business strategy of the two companies, including providing engineering network management, gateway service and a marketing arm, said Mr Anant. “Through the JV, we can form the strongest partnership in Asean for LEO commercial services, especially in the CLMV market,” he said.\nCAT was recently approved by the National Digital Economy and Society Committee as the sole agency handling the operations and assets of Thaicom’s satellite service concession, which is due to end in September next year.\nCAT expressed its readiness to take control of satellites Thaicom 4 and 6 after the concession expires.\nThe state telecom enterprise indicated it will assign 24 staff to attend a training course in collaboration with Thaicom for one year, starting this September, to ensure seamless continuity of the two satellites’ operation. Bangkok Post']}
{'url': ['https://www.khmertimeskh.com/50657009/mcdonalds-ceo-forced-out-over-consensual-relationship-with-employee/'], 'keyword': ['Financial'], 'title': ['McDonald’s CEO forced out over ‘consensual relationship’ with employee'], 'categories': [['Business']], 'time': [datetime.datetime(2019, 11, 4, 18, 3, 39)], 'text': ['(AFP) – McDonald’s announced Sunday that its president and CEO Steve Easterbrook was forced out after showing “poor judgment” by engaging in a “consensual relationship” with an employee.\nHe was replaced by Chris Kempczinski, the president of McDonald’s USA. Mr Kempczinski was also elected to the board of directors.\n“Easterbrook… has separated from the company following the board’s determination that he violated company policy and demonstrated poor judgment involving a recent consensual relationship with an employee,” the company said in a statement.\n“The company confirms that this leadership transition is unrelated to the company’s operational or financial performance.”\nIn an email to McDonald’s employees, Easterbrook said his relationship was “a mistake” that violated company policy.\n“Given the values of the company, I agree with the board that it is time for me to move on,” the email said.\nJoe Erlinger, president of international operated markets, will take over as head of McDonald’s USA, the company said.\nIn its most recent earnings report, on October 22, McDonald’s said profits dipped 1.8 percent in the third quarter from the year-ago period to $1.6 billion.\nRevenues at the company, which has 38,000 restaurants in more than 100 countries, edged up 1.1 percent to $5.4 billion.\nThe fast-food giant notched a healthy 5.9 percent increase in global comparable sales, including a solid rise in the United States.\nBut profits were pressured by increased spending on technology and research and development.\nMcDonald’s has invested heavily in home delivery and mobile pay initiatives in recent years, and in 2019 has unveiled a number of acquisitions to boost its drive-through operation.\nMr Kempczinski told The Wall Street Journal on Sunday that he plans to continue the focus on technology.\n“There isn’t going to be some radical, strategic shift. The plan is working,” he said.\nMr Easterbrook had served as chief executive since 2015. Under his leadership, McDonald’s share price doubled, but he was unable to stop a decline in sales.\nLike other fast food chains, McDonald’s is facing headwinds as consumers seek out healthier dining options.\nMr Easterbrook’s pay as CEO rose with McDonald’s share price, which closed last week at $194. His compensation hit a peak in 2017 at $21.8 million, including $9.1 million in incentive-based pay, the Journal said.\nWorkplace relationships have cost a number of CEOs their jobs in recent years, and the topic has become even more sensitive amid the #MeToo movement.\nIntel CEO Brian Krzanich and yoga apparel brand Lululemon chief Laurent Potdevin resigned from their companies in 2018 following revelations of relationships with employees.\nIn 2016, Priceline CEO Darren Huston stepped down for the same reason, as did BestBuy CEO Brian Dunn in 2012.']}
[2023-08-15 16:02:18,313][hyfi.pipeline.config][INFO] - Returning partial function: lexikanon.pipes.tokenize.tokenize_dataset with kwargs: {'_target_': 'lexikanon.pipes.tokenize.tokenize_dataset', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'text_col': 'text', 'token_col': 'tokenized', 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}
[2023-08-15 16:02:18,314][hyfi.composer.composer][INFO] - instantiating lexikanon.pipes.tokenize.tokenize_dataset ...
Map (num_proc=100): 100%|███████████| 1000/1000 [00:05<00:00, 187.20 examples/s]
[2023-08-15 16:02:24,403][lexikanon.pipes.tokenize][INFO] - POS tagging done. See column 'tokenized'.
[['set-listed/ADJ', 'thaicom/NOUN', 'say/VERB', 'a/DET', 'joint/ADJ', 'venture/NOUN', 'it/PRON', 'recently/ADV', 'form/VERB', 'with/ADP', 'cat/NOUN', 'telecom/NOUN', 'would/VERB', 'serve/VERB', 'a/ADP', 'a/DET', 'hub/NOUN', 'for/ADP', 'low/ADJ', 'earth/NOUN', 'orbit/NOUN', '(/.', 'leo/NOUN', ')/.', 'satellite/NOUN', 'service/NOUN', 'in/ADP', 'the/DET', 'cambodia/NOUN', ',/.', 'lao/NOUN', ',/.', 'myanmar/NOUN', 'and/CONJ', 'vietnam/NOUN', '(/.', 'clmv/NOUN', ')/.', 'market/NOUN', ',/.', 'cater/VERB', 'to/PRT', 'demand/VERB', 'for/ADP', 'high-speed/ADJ', 'internet/NOUN', 'via/ADP', '5g/NUM', 'tech/NOUN', 'and/CONJ', 'innovative/ADJ', 'application/NOUN', './.', 'the/DET', 'establishment/NOUN', 'of/ADP', 'the/DET', 'joint/ADJ', 'venture/NOUN', ',/.', 'call/VERB', 'nation/NOUN', 'space/NOUN', 'and/CONJ', 'technology/NOUN', 'co/NOUN', ',/.', 'be/VERB', 'report/VERB', 'by/ADP', 'thaicom/NOUN', 'to/PRT', 'the/DET', 'stock/NOUN', 'exchange/NOUN', 'of/ADP', 'thailand/NOUN', 'late/ADJ', 'last/ADJ', 'month/NOUN', './.', 'in/ADP', 'the/DET', 'jv/NOUN', 'with/ADP', 'registered/ADJ', 'capital/NOUN', 'of/ADP', '10/NUM', 'million/NUM', 'baht/NOUN', ',/.', 'thaicom/NOUN', 'hold/VERB', 'a/DET', '75/NUM', '%/NOUN', 'stake/NOUN', ',/.', 'while/ADP', 'cat/NOUN', 'hold/VERB', '25/NUM', '%/NOUN', './.', 'the/DET', 'jv/NOUN', 'be/VERB', 'aim/VERB', 'at/ADP', 'provide/VERB', 'satellite/ADJ', 'gateway/NOUN', 'service/NOUN', 'and/CONJ', 'solution/NOUN', 'as/ADV', 'well/ADV', 'a/ADP', 'market/VERB', 'the/DET', 'sale/NOUN', 'of/ADP', 'leo/ADJ', 'satellite/NOUN', './.', 'thaicom/NOUN', 'chief/ADJ', 'executive/NOUN', 'anant/NOUN', 'kaewruamvongs/NOUN', 'say/VERB', 'leo/ADJ', 'satellite/NOUN', 'operate/VERB', 'between/ADP', '500/NUM', 'and/CONJ', '2,000/NUM', 'kilometre/NOUN', 'above/ADP', 'the/DET', 'earth/NOUN', "'s/PRT", 'surface/NOUN', ',/.', 'compare/VERB', 'to/PRT', '36,000km/NUM', 'for/ADP', 'geostationary/ADJ', 'satellite/NOUN', ',/.', 'a/DET', 'traditional/ADJ', 'type/NOUN', 'of/ADP', 'communication/NOUN', 'satellite/NOUN', './.', 'he/PRON', 'say/VERB', 'the/DET', 'advantage/NOUN', 'of/ADP', 'leo/NOUN', 'satellite/NOUN', 'be/VERB', 'a/DET', 'low/ADJ', 'latency/NOUN', 'signal/NOUN', ',/.', 'which/DET', 'could/VERB', 'benefit/VERB', 'people/NOUN', 'with/ADP', 'access/NOUN', 'to/PRT', 'high-speed/ADJ', 'internet/NOUN', 'service/NOUN', 'via/ADP', '5g/NUM', 'tech/NOUN', 'as/ADV', 'well/ADV', 'a/ADP', 'the/DET', 'usage/NOUN', 'of/ADP', 'internet/NOUN', 'of/ADP', 'thing/NOUN', 'device/NOUN', ',/.', 'machine-to-machine/ADJ', 'tech/NOUN', ',/.', 'drone/NOUN', 'and/CONJ', 'application/NOUN', 'that/ADP', 'require/VERB', 'high/ADJ', 'level/NOUN', 'of/ADP', 'accuracy/NOUN', ',/.', 'such/ADJ', 'a/ADP', 'remote/ADJ', 'surgery/NOUN', './.', 'leo/NOUN', 'satellite/NOUN', 'project/NOUN', 'be/VERB', 'operate/VERB', 'by/ADP', 'two/NUM', 'main/ADJ', 'company/NOUN', ':/.', 'space/NOUN', 'exploration/NOUN', 'technology/NOUN', 'corp/VERB', '(/.', 'spacex/NOUN', ')/.', 'and/CONJ', 'london-based/ADJ', 'satellite/NOUN', 'internet/NOUN', 'access/NOUN', 'provider/NOUN', 'oneweb/NOUN', './.', 'leo/NOUN', 'satellite/NOUN', 'have/VERB', 'be/VERB', 'launch/VERB', 'and/CONJ', 'commercial/ADJ', 'service/NOUN', 'be/VERB', 'expect/VERB', 'from/ADP', 'next/ADJ', 'year/NOUN', './.', 'mr/NOUN', 'anant/NOUN', 'say/VERB', 'related/ADJ', 'leo/NOUN', 'business/NOUN', 'be/VERB', 'go/VERB', 'to/PRT', 'be/VERB', 'the/DET', 'focus/NOUN', 'of/ADP', 'the/DET', 'jv/NOUN', './.', '``/.', 'thaicom/NOUN', "'s/PRT", 'strength/NOUN', 'lie/VERB', 'in/ADP', 'global/ADJ', 'communication/NOUN', 'and/CONJ', 'network/NOUN', 'system/NOUN', 'management/NOUN', 'as/ADV', 'well/ADV', 'a/ADP', 'marketing/NOUN', 'and/CONJ', 'sale/NOUN', 'channel/NOUN', ',/.', 'while/ADP', 'cat/NOUN', 'telecom/NOUN', "'s/PRT", 'strength/NOUN', 'be/VERB', 'gateway/ADJ', 'service/NOUN', ',/.', 'submarine/ADJ', 'fibre/NOUN', 'and/CONJ', 'telecom/NOUN', 'network/NOUN', ',/.', "''/.", 'he/PRON', 'say/VERB', './.', 'the/DET', 'partnership/NOUN', 'should/VERB', 'enhance/VERB', 'the/DET', 'business/NOUN', 'strategy/NOUN', 'of/ADP', 'the/DET', 'two/NUM', 'company/NOUN', ',/.', 'include/VERB', 'provide/VERB', 'engineering/NOUN', 'network/NOUN', 'management/NOUN', ',/.', 'gateway/NOUN', 'service/NOUN', 'and/CONJ', 'a/DET', 'marketing/NOUN', 'arm/NOUN', ',/.', 'say/VERB', 'mr/NOUN', 'anant/NOUN', './.', '``/.', 'through/ADP', 'the/DET', 'jv/NOUN', ',/.', 'we/PRON', 'can/VERB', 'form/VERB', 'the/DET', 'strong/ADJ', 'partnership/NOUN', 'in/ADP', 'asean/NOUN', 'for/ADP', 'leo/ADJ', 'commercial/ADJ', 'service/NOUN', ',/.', 'especially/ADV', 'in/ADP', 'the/DET', 'clmv/NOUN', 'market/NOUN', ',/.', "''/.", 'he/PRON', 'say/VERB', './.', 'cat/NOUN', 'be/VERB', 'recently/ADV', 'approve/VERB', 'by/ADP', 'the/DET', 'national/ADJ', 'digital/ADJ', 'economy/NOUN', 'and/CONJ', 'society/NOUN', 'committee/NOUN', 'a/ADP', 'the/DET', 'sole/ADJ', 'agency/NOUN', 'handle/VERB', 'the/DET', 'operation/NOUN', 'and/CONJ', 'asset/NOUN', 'of/ADP', 'thaicom/NOUN', "'s/PRT", 'satellite/ADJ', 'service/NOUN', 'concession/NOUN', ',/.', 'which/DET', 'be/VERB', 'due/ADJ', 'to/PRT', 'end/VERB', 'in/ADP', 'september/NOUN', 'next/ADJ', 'year/NOUN', './.', 'cat/NOUN', 'express/VERB', 'it/PRON', 'readiness/NOUN', 'to/PRT', 'take/VERB', 'control/NOUN', 'of/ADP', 'satellite/NOUN', 'thaicom/VERB', '4/NUM', 'and/CONJ', '6/NUM', 'after/ADP', 'the/DET', 'concession/NOUN', 'expire/VERB', './.', 'the/DET', 'state/NOUN', 'telecom/NOUN', 'enterprise/NOUN', 'indicate/VERB', 'it/PRON', 'will/VERB', 'assign/VERB', '24/NUM', 'staff/NOUN', 'to/PRT', 'attend/VERB', 'a/DET', 'training/NOUN', 'course/NOUN', 'in/ADP', 'collaboration/NOUN', 'with/ADP', 'thaicom/NOUN', 'for/ADP', 'one/NUM', 'year/NOUN', ',/.', 'start/VERB', 'this/DET', 'september/NOUN', ',/.', 'to/PRT', 'ensure/VERB', 'seamless/ADJ', 'continuity/NOUN', 'of/ADP', 'the/DET', 'two/NUM', 'satellite/NOUN', "'/PRT", 'operation/NOUN', './.', 'bangkok/NOUN', 'post/NOUN']]
[['(/.', 'afp/NOUN', ')/.', '-/.', 'mcdonald/NOUN', "'s/PRT", 'announce/VERB', 'sunday/NOUN', 'that/ADP', 'it/PRON', 'president/NOUN', 'and/CONJ', 'ceo/NOUN', 'steve/VERB', 'easterbrook/NOUN', 'be/VERB', 'force/VERB', 'out/PRT', 'after/ADP', 'show/VERB', '``/.', 'poor/ADJ', 'judgment/NOUN', "''/.", 'by/ADP', 'engage/VERB', 'in/ADP', 'a/DET', '``/.', 'consensual/ADJ', 'relationship/NOUN', "''/.", 'with/ADP', 'an/DET', 'employee/NOUN', './.', 'he/PRON', 'be/VERB', 'replace/VERB', 'by/ADP', 'chris/NOUN', 'kempczinski/NOUN', ',/.', 'the/DET', 'president/NOUN', 'of/ADP', 'mcdonald/NOUN', "'s/PRT", 'usa/ADJ', './.', 'mr/NOUN', 'kempczinski/NOUN', 'be/VERB', 'also/ADV', 'elect/VERB', 'to/PRT', 'the/DET', 'board/NOUN', 'of/ADP', 'director/NOUN', './.', '``/.', 'easterbrook/NOUN', '.../.', 'have/VERB', 'separate/VERB', 'from/ADP', 'the/DET', 'company/NOUN', 'follow/VERB', 'the/DET', 'board/NOUN', "'s/PRT", 'determination/NOUN', 'that/ADP', 'he/PRON', 'violate/VERB', 'company/NOUN', 'policy/NOUN', 'and/CONJ', 'demonstrate/VERB', 'poor/ADJ', 'judgment/NOUN', 'involve/VERB', 'a/DET', 'recent/ADJ', 'consensual/ADJ', 'relationship/NOUN', 'with/ADP', 'an/DET', 'employee/NOUN', ',/.', "''/.", 'the/DET', 'company/NOUN', 'say/VERB', 'in/ADP', 'a/DET', 'statement/NOUN', './.', '``/.', 'the/DET', 'company/NOUN', 'confirm/VERB', 'that/ADP', 'this/DET', 'leadership/NOUN', 'transition/NOUN', 'be/VERB', 'unrelated/ADJ', 'to/PRT', 'the/DET', 'company/NOUN', "'s/PRT", 'operational/ADJ', 'or/CONJ', 'financial/ADJ', 'performance/NOUN', './.', "''/.", 'in/ADP', 'an/DET', 'email/NOUN', 'to/PRT', 'mcdonald/VERB', "'s/PRT", 'employee/NOUN', ',/.', 'easterbrook/NOUN', 'say/VERB', 'his/PRON', 'relationship/NOUN', 'be/VERB', '``/.', 'a/DET', 'mistake/NOUN', "''/.", 'that/DET', 'violate/VERB', 'company/NOUN', 'policy/NOUN', './.', '``/.', 'give/VERB', 'the/DET', 'value/NOUN', 'of/ADP', 'the/DET', 'company/NOUN', ',/.', 'i/ADJ', 'agree/VERB', 'with/ADP', 'the/DET', 'board/NOUN', 'that/ADP', 'it/PRON', 'be/VERB', 'time/NOUN', 'for/ADP', 'me/PRON', 'to/PRT', 'move/VERB', 'on/ADP', ',/.', "''/.", 'the/DET', 'email/NOUN', 'say/VERB', './.', 'joe/NOUN', 'erlinger/NOUN', ',/.', 'president/NOUN', 'of/ADP', 'international/ADJ', 'operated/ADJ', 'market/NOUN', ',/.', 'will/VERB', 'take/VERB', 'over/PRT', 'a/ADP', 'head/NOUN', 'of/ADP', 'mcdonald/NOUN', "'s/PRT", 'usa/ADJ', ',/.', 'the/DET', 'company/NOUN', 'say/VERB', './.', 'in/ADP', 'it/PRON', 'most/ADV', 'recent/ADJ', 'earnings/NOUN', 'report/NOUN', ',/.', 'on/ADP', 'october/PRON', '22/NUM', ',/.', 'mcdonald/NOUN', "'s/PRT", 'say/VERB', 'profit/NOUN', 'dip/VERB', '1.8/NUM', 'percent/NOUN', 'in/ADP', 'the/DET', 'third/ADJ', 'quarter/NOUN', 'from/ADP', 'the/DET', 'year-ago/ADJ', 'period/NOUN', 'to/PRT', '$/.', '1.6/NUM', 'billion/NUM', './.', 'revenue/NOUN', 'at/ADP', 'the/DET', 'company/NOUN', ',/.', 'which/DET', 'have/VERB', '38,000/NUM', 'restaurant/NOUN', 'in/ADP', 'more/ADJ', 'than/ADP', '100/NUM', 'country/NOUN', ',/.', 'edge/VERB', 'up/ADV', '1.1/NUM', 'percent/NOUN', 'to/PRT', '$/.', '5.4/NUM', 'billion/NUM', './.', 'the/DET', 'fast-food/NOUN', 'giant/NOUN', 'notch/VERB', 'a/DET', 'healthy/ADJ', '5.9/NUM', 'percent/ADJ', 'increase/NOUN', 'in/ADP', 'global/ADJ', 'comparable/ADJ', 'sale/NOUN', ',/.', 'include/VERB', 'a/DET', 'solid/ADJ', 'rise/NOUN', 'in/ADP', 'the/DET', 'united/ADJ', 'state/NOUN', './.', 'but/CONJ', 'profit/NOUN', 'be/VERB', 'pressure/VERB', 'by/ADP', 'increased/ADJ', 'spending/NOUN', 'on/ADP', 'technology/NOUN', 'and/CONJ', 'research/NOUN', 'and/CONJ', 'development/NOUN', './.', 'mcdonald/NOUN', "'s/PRT", 'have/VERB', 'invest/VERB', 'heavily/ADV', 'in/ADP', 'home/NOUN', 'delivery/NOUN', 'and/CONJ', 'mobile/ADJ', 'pay/NOUN', 'initiative/NOUN', 'in/ADP', 'recent/ADJ', 'year/NOUN', ',/.', 'and/CONJ', 'in/ADP', '2019/NUM', 'have/VERB', 'unveil/VERB', 'a/DET', 'number/NOUN', 'of/ADP', 'acquisition/NOUN', 'to/PRT', 'boost/VERB', 'it/PRON', 'drive-through/ADJ', 'operation/NOUN', './.', 'mr/NOUN', 'kempczinski/NOUN', 'tell/VERB', 'the/DET', 'wall/ADJ', 'street/NOUN', 'journal/NOUN', 'on/ADP', 'sunday/NOUN', 'that/ADP', 'he/PRON', 'plan/VERB', 'to/PRT', 'continue/VERB', 'the/DET', 'focus/NOUN', 'on/ADP', 'technology/NOUN', './.', '``/.', 'there/DET', 'be/VERB', "n't/ADV", 'go/VERB', 'to/PRT', 'be/VERB', 'some/DET', 'radical/ADJ', ',/.', 'strategic/ADJ', 'shift/NOUN', './.', 'the/DET', 'plan/NOUN', 'be/VERB', 'work/VERB', ',/.', "''/.", 'he/PRON', 'say/VERB', './.', 'mr/NOUN', 'easterbrook/NOUN', 'have/VERB', 'serve/VERB', 'a/ADP', 'chief/NOUN', 'executive/NOUN', 'since/ADP', '2015./NUM', 'under/ADP', 'his/PRON', 'leadership/NOUN', ',/.', 'mcdonald/NOUN', "'s/PRT", 'share/NOUN', 'price/NOUN', 'double/VERB', ',/.', 'but/CONJ', 'he/PRON', 'be/VERB', 'unable/ADJ', 'to/PRT', 'stop/VERB', 'a/DET', 'decline/NOUN', 'in/ADP', 'sale/NOUN', './.', 'like/ADP', 'other/ADJ', 'fast/ADJ', 'food/NOUN', 'chain/NOUN', ',/.', 'mcdonald/NOUN', "'s/PRT", 'be/VERB', 'face/VERB', 'headwind/NOUN', 'a/ADP', 'consumer/NOUN', 'seek/VERB', 'out/PRT', 'healthy/ADJ', 'din/VERB', 'option/NOUN', './.', 'mr/NOUN', 'easterbrook/NOUN', "'s/PRT", 'pay/NOUN', 'a/ADP', 'ceo/NOUN', 'rise/VERB', 'with/ADP', 'mcdonald/NOUN', "'s/PRT", 'share/NOUN', 'price/NOUN', ',/.', 'which/DET', 'close/VERB', 'last/ADJ', 'week/NOUN', 'at/ADP', '$/.', '194/NUM', './.', 'his/PRON', 'compensation/NOUN', 'hit/VERB', 'a/DET', 'peak/NOUN', 'in/ADP', '2017/NUM', 'at/ADP', '$/.', '21.8/NUM', 'million/NUM', ',/.', 'include/VERB', '$/.', '9.1/NUM', 'million/NUM', 'in/ADP', 'incentive-based/ADJ', 'pay/NOUN', ',/.', 'the/DET', 'journal/NOUN', 'say/VERB', './.', 'workplace/NOUN', 'relationship/NOUN', 'have/VERB', 'cost/VERB', 'a/DET', 'number/NOUN', 'of/ADP', 'ceo/NOUN', 'their/PRON', 'job/NOUN', 'in/ADP', 'recent/ADJ', 'year/NOUN', ',/.', 'and/CONJ', 'the/DET', 'topic/NOUN', 'have/VERB', 'become/VERB', 'even/ADV', 'more/ADV', 'sensitive/ADJ', 'amid/ADP', 'the/DET', '#/.', 'metoo/ADJ', 'movement/NOUN', './.', 'intel/NOUN', 'ceo/NOUN', 'brian/NOUN', 'krzanich/NOUN', 'and/CONJ', 'yoga/NOUN', 'apparel/NOUN', 'brand/NOUN', 'lululemon/NOUN', 'chief/ADJ', 'laurent/ADJ', 'potdevin/NOUN', 'resign/VERB', 'from/ADP', 'their/PRON', 'company/NOUN', 'in/ADP', '2018/NUM', 'follow/VERB', 'revelation/NOUN', 'of/ADP', 'relationship/NOUN', 'with/ADP', 'employee/NOUN', './.', 'in/ADP', '2016/NUM', ',/.', 'priceline/VERB', 'ceo/ADJ', 'darren/NOUN', 'huston/NOUN', 'step/VERB', 'down/ADV', 'for/ADP', 'the/DET', 'same/ADJ', 'reason/NOUN', ',/.', 'a/ADP', 'do/VERB', 'bestbuy/VERB', 'ceo/NOUN', 'brian/ADJ', 'dunn/NOUN', 'in/ADP', '2012/NUM', './.']]
[2023-08-15 16:02:24,409][hyfi.pipeline.config][INFO] - Returning partial function: lexikanon.pipes.tokenize.extract_tokens with kwargs: {'_target_': 'lexikanon.pipes.tokenize.extract_tokens', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'token_col': 'tokenized', 'extracted_col': 'tokens', 'nouns_only': False, 'postags': None, 'stop_postags': None, 'strip_pos': True, 'postag_delim': None, 'postag_length': None, 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}
[2023-08-15 16:02:24,410][hyfi.composer.composer][INFO] - instantiating lexikanon.pipes.tokenize.extract_tokens ...
Map (num_proc=100): 100%|███████████| 1000/1000 [00:03<00:00, 313.19 examples/s]
[2023-08-15 16:02:28,396][lexikanon.pipes.tokenize][INFO] - Extracting tokens done, see column 'tokens'.
[['set-listed', 'thaicom', 'say', 'joint', 'venture', 'recently', 'form', 'cat', 'telecom', 'would', 'serve', 'hub', 'low', 'earth', 'orbit', 'leo', 'satellite', 'service', 'cambodia', 'lao', 'myanmar', 'vietnam', 'clmv', 'market', 'cater', 'demand', 'high-speed', 'internet', 'tech', 'innovative', 'application', 'establishment', 'joint', 'venture', 'call', 'nation', 'space', 'technology', 'report', 'thaicom', 'stock', 'exchange', 'thailand', 'late', 'last', 'month', 'registered', 'capital', 'baht', 'thaicom', 'hold', 'stake', 'cat', 'hold', 'aim', 'provide', 'satellite', 'gateway', 'service', 'solution', 'well', 'market', 'sale', 'leo', 'satellite', 'thaicom', 'chief', 'executive', 'anant', 'kaewruamvongs', 'say', 'leo', 'satellite', 'operate', 'kilometre', 'earth', 'surface', 'compare', 'geostationary', 'satellite', 'traditional', 'type', 'communication', 'satellite', 'say', 'advantage', 'leo', 'satellite', 'low', 'latency', 'signal', 'could', 'benefit', 'people', 'access', 'high-speed', 'internet', 'service', 'tech', 'well', 'usage', 'internet', 'thing', 'device', 'machine-to-machine', 'tech', 'drone', 'application', 'require', 'high', 'level', 'accuracy', 'remote', 'surgery', 'leo', 'satellite', 'project', 'operate', 'main', 'company', 'space', 'exploration', 'technology', 'corp', 'spacex', 'london-based', 'satellite', 'internet', 'access', 'provider', 'oneweb', 'leo', 'satellite', 'launch', 'commercial', 'service', 'expect', 'next', 'year', 'anant', 'say', 'related', 'leo', 'business', 'focus', 'thaicom', 'strength', 'lie', 'global', 'communication', 'network', 'system', 'management', 'well', 'marketing', 'sale', 'channel', 'cat', 'telecom', 'strength', 'gateway', 'service', 'submarine', 'fibre', 'telecom', 'network', 'say', 'partnership', 'enhance', 'business', 'strategy', 'company', 'include', 'provide', 'engineering', 'network', 'management', 'gateway', 'service', 'marketing', 'arm', 'say', 'anant', 'form', 'strong', 'partnership', 'asean', 'leo', 'commercial', 'service', 'especially', 'clmv', 'market', 'say', 'cat', 'recently', 'approve', 'national', 'digital', 'economy', 'society', 'committee', 'sole', 'agency', 'handle', 'operation', 'asset', 'thaicom', 'satellite', 'service', 'concession', 'due', 'end', 'september', 'next', 'year', 'cat', 'express', 'readiness', 'take', 'control', 'satellite', 'thaicom', 'concession', 'expire', 'state', 'telecom', 'enterprise', 'indicate', 'assign', 'staff', 'attend', 'training', 'course', 'collaboration', 'thaicom', 'year', 'start', 'september', 'ensure', 'seamless', 'continuity', 'satellite', 'operation', 'bangkok', 'post']]
[['afp', 'mcdonald', 'announce', 'sunday', 'president', 'ceo', 'steve', 'easterbrook', 'force', 'show', 'poor', 'judgment', 'engage', 'consensual', 'relationship', 'employee', 'replace', 'chris', 'kempczinski', 'president', 'mcdonald', 'usa', 'kempczinski', 'also', 'elect', 'board', 'director', 'easterbrook', 'separate', 'company', 'follow', 'board', 'determination', 'violate', 'company', 'policy', 'demonstrate', 'poor', 'judgment', 'involve', 'recent', 'consensual', 'relationship', 'employee', 'company', 'say', 'statement', 'company', 'confirm', 'leadership', 'transition', 'unrelated', 'company', 'operational', 'financial', 'performance', 'email', 'mcdonald', 'employee', 'easterbrook', 'say', 'relationship', 'mistake', 'violate', 'company', 'policy', 'give', 'value', 'company', 'agree', 'board', 'time', 'move', 'email', 'say', 'joe', 'erlinger', 'president', 'international', 'operated', 'market', 'take', 'head', 'mcdonald', 'usa', 'company', 'say', 'recent', 'earnings', 'report', 'mcdonald', 'say', 'profit', 'dip', 'percent', 'third', 'quarter', 'year-ago', 'period', 'revenue', 'company', 'restaurant', 'country', 'edge', 'percent', 'fast-food', 'giant', 'notch', 'healthy', 'percent', 'increase', 'global', 'comparable', 'sale', 'include', 'solid', 'rise', 'united', 'state', 'profit', 'pressure', 'increased', 'spending', 'technology', 'research', 'development', 'mcdonald', 'invest', 'heavily', 'home', 'delivery', 'mobile', 'pay', 'initiative', 'recent', 'year', 'unveil', 'number', 'acquisition', 'boost', 'drive-through', 'operation', 'kempczinski', 'tell', 'wall', 'street', 'journal', 'sunday', 'plan', 'continue', 'focus', 'technology', "n't", 'radical', 'strategic', 'shift', 'plan', 'work', 'say', 'easterbrook', 'serve', 'chief', 'executive', 'leadership', 'mcdonald', 'share', 'price', 'double', 'unable', 'stop', 'decline', 'sale', 'fast', 'food', 'chain', 'mcdonald', 'face', 'headwind', 'consumer', 'seek', 'healthy', 'din', 'option', 'easterbrook', 'pay', 'ceo', 'rise', 'mcdonald', 'share', 'price', 'close', 'last', 'week', 'compensation', 'hit', 'peak', 'include', 'incentive-based', 'pay', 'journal', 'say', 'workplace', 'relationship', 'cost', 'number', 'ceo', 'job', 'recent', 'year', 'topic', 'become', 'even', 'sensitive', 'metoo', 'movement', 'intel', 'ceo', 'brian', 'krzanich', 'yoga', 'apparel', 'brand', 'lululemon', 'chief', 'laurent', 'potdevin', 'resign', 'company', 'follow', 'revelation', 'relationship', 'employee', 'priceline', 'ceo', 'darren', 'huston', 'step', 'reason', 'bestbuy', 'ceo', 'brian', 'dunn']]
[2023-08-15 16:02:28,404][hyfi.pipeline.config][INFO] - Returning partial function: lexikanon.pipes.tokenize.extract_tokens with kwargs: {'_target_': 'lexikanon.pipes.tokenize.extract_tokens', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'token_col': 'tokenized', 'extracted_col': 'adjnouns', 'nouns_only': False, 'postags': ['ADJ', 'NOUN'], 'stop_postags': None, 'strip_pos': True, 'postag_delim': None, 'postag_length': None, 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}
[2023-08-15 16:02:28,406][hyfi.composer.composer][INFO] - instantiating lexikanon.pipes.tokenize.extract_tokens ...
Map (num_proc=100): 100%|███████████| 1000/1000 [00:03<00:00, 325.62 examples/s]
[2023-08-15 16:02:32,305][lexikanon.pipes.tokenize][INFO] - Extracting tokens done, see column 'adjnouns'.
[['set-listed', 'thaicom', 'joint', 'venture', 'cat', 'telecom', 'hub', 'low', 'earth', 'orbit', 'leo', 'satellite', 'service', 'cambodia', 'lao', 'myanmar', 'vietnam', 'clmv', 'market', 'high-speed', 'internet', 'tech', 'innovative', 'application', 'establishment', 'joint', 'venture', 'nation', 'space', 'technology', 'thaicom', 'stock', 'exchange', 'thailand', 'late', 'last', 'month', 'registered', 'capital', 'baht', 'thaicom', 'stake', 'cat', 'satellite', 'gateway', 'service', 'solution', 'sale', 'leo', 'satellite', 'thaicom', 'chief', 'executive', 'anant', 'kaewruamvongs', 'leo', 'satellite', 'kilometre', 'earth', 'surface', 'geostationary', 'satellite', 'traditional', 'type', 'communication', 'satellite', 'advantage', 'leo', 'satellite', 'low', 'latency', 'signal', 'people', 'access', 'high-speed', 'internet', 'service', 'tech', 'usage', 'internet', 'thing', 'device', 'machine-to-machine', 'tech', 'drone', 'application', 'high', 'level', 'accuracy', 'remote', 'surgery', 'leo', 'satellite', 'project', 'main', 'company', 'space', 'exploration', 'technology', 'spacex', 'london-based', 'satellite', 'internet', 'access', 'provider', 'oneweb', 'leo', 'satellite', 'commercial', 'service', 'next', 'year', 'anant', 'related', 'leo', 'business', 'focus', 'thaicom', 'strength', 'global', 'communication', 'network', 'system', 'management', 'marketing', 'sale', 'channel', 'cat', 'telecom', 'strength', 'gateway', 'service', 'submarine', 'fibre', 'telecom', 'network', 'partnership', 'business', 'strategy', 'company', 'engineering', 'network', 'management', 'gateway', 'service', 'marketing', 'arm', 'anant', 'strong', 'partnership', 'asean', 'leo', 'commercial', 'service', 'clmv', 'market', 'cat', 'national', 'digital', 'economy', 'society', 'committee', 'sole', 'agency', 'operation', 'asset', 'thaicom', 'satellite', 'service', 'concession', 'due', 'september', 'next', 'year', 'cat', 'readiness', 'control', 'satellite', 'concession', 'state', 'telecom', 'enterprise', 'staff', 'training', 'course', 'collaboration', 'thaicom', 'year', 'september', 'seamless', 'continuity', 'satellite', 'operation', 'bangkok', 'post']]
[['afp', 'mcdonald', 'sunday', 'president', 'ceo', 'easterbrook', 'poor', 'judgment', 'consensual', 'relationship', 'employee', 'chris', 'kempczinski', 'president', 'mcdonald', 'usa', 'kempczinski', 'board', 'director', 'easterbrook', 'company', 'board', 'determination', 'company', 'policy', 'poor', 'judgment', 'recent', 'consensual', 'relationship', 'employee', 'company', 'statement', 'company', 'leadership', 'transition', 'unrelated', 'company', 'operational', 'financial', 'performance', 'email', 'employee', 'easterbrook', 'relationship', 'mistake', 'company', 'policy', 'value', 'company', 'board', 'time', 'email', 'joe', 'erlinger', 'president', 'international', 'operated', 'market', 'head', 'mcdonald', 'usa', 'company', 'recent', 'earnings', 'report', 'mcdonald', 'profit', 'percent', 'third', 'quarter', 'year-ago', 'period', 'revenue', 'company', 'restaurant', 'country', 'percent', 'fast-food', 'giant', 'healthy', 'percent', 'increase', 'global', 'comparable', 'sale', 'solid', 'rise', 'united', 'state', 'profit', 'increased', 'spending', 'technology', 'research', 'development', 'mcdonald', 'home', 'delivery', 'mobile', 'pay', 'initiative', 'recent', 'year', 'number', 'acquisition', 'drive-through', 'operation', 'kempczinski', 'wall', 'street', 'journal', 'sunday', 'focus', 'technology', 'radical', 'strategic', 'shift', 'plan', 'easterbrook', 'chief', 'executive', 'leadership', 'mcdonald', 'share', 'price', 'unable', 'decline', 'sale', 'fast', 'food', 'chain', 'mcdonald', 'headwind', 'consumer', 'healthy', 'option', 'easterbrook', 'pay', 'ceo', 'mcdonald', 'share', 'price', 'last', 'week', 'compensation', 'peak', 'incentive-based', 'pay', 'journal', 'workplace', 'relationship', 'number', 'ceo', 'job', 'recent', 'year', 'topic', 'sensitive', 'metoo', 'movement', 'intel', 'ceo', 'brian', 'krzanich', 'yoga', 'apparel', 'brand', 'lululemon', 'chief', 'laurent', 'potdevin', 'company', 'revelation', 'relationship', 'employee', 'ceo', 'darren', 'huston', 'reason', 'ceo', 'brian', 'dunn']]
[2023-08-15 16:02:32,320][hyfi.pipeline.config][INFO] - Returning partial function: lexikanon.pipes.tokenize.extract_tokens with kwargs: {'_target_': 'lexikanon.pipes.tokenize.extract_tokens', 'tokenizer': 'nbcpu', 'num_workers': 100, 'batched': True, 'batch_size': 1000, 'token_col': 'tokenized', 'extracted_col': 'predicates', 'nouns_only': False, 'postags': ['VERB', 'PRT', 'ADJ', 'ADV'], 'stop_postags': None, 'strip_pos': True, 'postag_delim': None, 'postag_length': None, 'remove_columns': None, 'load_from_cache_file': False, 'num_heads': 1, 'num_tails': 1, 'verbose': True}
[2023-08-15 16:02:32,323][hyfi.composer.composer][INFO] - instantiating lexikanon.pipes.tokenize.extract_tokens ...
Map (num_proc=100): 100%|███████████| 1000/1000 [00:03<00:00, 305.79 examples/s]
[2023-08-15 16:02:36,435][lexikanon.pipes.tokenize][INFO] - Extracting tokens done, see column 'predicates'.
[['set-listed', 'say', 'joint', 'recently', 'form', 'would', 'serve', 'low', 'cater', 'demand', 'high-speed', 'innovative', 'joint', 'call', 'report', 'late', 'last', 'registered', 'hold', 'hold', 'aim', 'provide', 'satellite', 'well', 'market', 'leo', 'chief', 'say', 'leo', 'operate', 'compare', 'geostationary', 'traditional', 'say', 'low', 'could', 'benefit', 'high-speed', 'well', 'machine-to-machine', 'require', 'high', 'remote', 'operate', 'main', 'corp', 'london-based', 'launch', 'commercial', 'expect', 'next', 'say', 'related', 'lie', 'global', 'well', 'gateway', 'submarine', 'say', 'enhance', 'include', 'provide', 'say', 'form', 'strong', 'leo', 'commercial', 'especially', 'say', 'recently', 'approve', 'national', 'digital', 'sole', 'handle', 'satellite', 'due', 'end', 'next', 'express', 'take', 'thaicom', 'expire', 'indicate', 'assign', 'attend', 'start', 'ensure', 'seamless']]
[['announce', 'steve', 'force', 'show', 'poor', 'engage', 'consensual', 'replace', 'usa', 'also', 'elect', 'separate', 'follow', 'violate', 'demonstrate', 'poor', 'involve', 'recent', 'consensual', 'say', 'confirm', 'unrelated', 'operational', 'financial', 'mcdonald', 'say', 'violate', 'give', 'agree', 'move', 'say', 'international', 'operated', 'take', 'usa', 'say', 'recent', 'say', 'dip', 'third', 'year-ago', 'edge', 'notch', 'healthy', 'percent', 'global', 'comparable', 'include', 'solid', 'united', 'pressure', 'increased', 'invest', 'heavily', 'mobile', 'recent', 'unveil', 'boost', 'drive-through', 'tell', 'wall', 'plan', 'continue', "n't", 'radical', 'strategic', 'work', 'say', 'serve', 'double', 'unable', 'stop', 'fast', 'face', 'seek', 'healthy', 'din', 'rise', 'close', 'last', 'hit', 'include', 'incentive-based', 'say', 'cost', 'recent', 'become', 'even', 'sensitive', 'metoo', 'chief', 'laurent', 'resign', 'follow', 'priceline', 'ceo', 'step', 'bestbuy', 'brian']]
[2023-08-15 16:02:36,441][hyfi.pipeline.config][INFO] - Running a pipe with hyfi.pipe.general_instance_methods
[2023-08-15 16:02:36,504][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns with kwargs: {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns', 'expressions': {'id': "url.str.split('/').str[3]"}, 'engine': 'python', 'verbose': True}
[2023-08-15 16:02:36,504][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.basic.DSBasic.dataframe_eval_columns ...
[2023-08-15 16:02:36,506][hyfi.utils.datasets.basic][INFO] - Evaluating column id
[2023-08-15 16:02:36,584][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data with kwargs: {'_target_': 'hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data', 'queries': ['tokens.str.len() > 50'], 'sample_size': None, 'sample_seed': 42, 'output_dir': 'datasets/processed/khmer_tokenized_sample', 'sample_filename': None, 'train_filename': 'train.parquet', 'discard_filename': 'discard.parquet', 'returning_data': 'train', 'verbose': True}
[2023-08-15 16:02:36,585][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.slice.DSSlice.filter_and_sample_data ...
[2023-08-15 16:02:36,588][hyfi.utils.datasets.slice][INFO] - filtering data by tokens.str.len() > 50
[2023-08-15 16:02:36,590][hyfi.utils.datasets.slice][INFO] - filtered 8 documents
[2023-08-15 16:02:36,591][hyfi.utils.datasets.save][INFO] - Saving dataframe to /raid/cis/yjlee/workspace/projects/nbcpu/workspace/datasets/processed/khmer_tokenized_sample/train.parquet
[2023-08-15 16:02:37,150][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.558764
                                                 url  ...         id
0  https://www.khmertimeskh.com/743741/thaicom-ca...  ...     743741
1  https://www.khmertimeskh.com/501295606/cambodi...  ...  501295606
2  https://www.khmertimeskh.com/12732/toshiba-to-...  ...      12732
3  https://www.khmertimeskh.com/50618111/can-the-...  ...   50618111
4  https://www.khmertimeskh.com/50747762/go-green...  ...   50747762

[5 rows x 11 columns]
[2023-08-15 16:02:37,166][hyfi.utils.datasets.save][INFO] - Saving dataframe to /raid/cis/yjlee/workspace/projects/nbcpu/workspace/datasets/processed/khmer_tokenized_sample/discard.parquet
[2023-08-15 16:02:37,169][hyfi.utils.datasets.save][INFO] -  >> elapsed time to save data: 0:00:00.003255
                                                   url  ...         id
53   https://www.khmertimeskh.com/50981157/37-roads...  ...   50981157
61   https://www.khmertimeskh.com/50885944/covid-wa...  ...   50885944
269  https://www.khmertimeskh.com/501258753/evolvin...  ...  501258753
282  https://www.khmertimeskh.com/510155/china-prom...  ...     510155
376  https://www.khmertimeskh.com/501312739/video-g...  ...  501312739

[5 rows x 11 columns]
[2023-08-15 16:02:37,180][hyfi.utils.datasets.slice][INFO] - Created 0 samples, 992 train samples, and 8 discard samples
[2023-08-15 16:02:37,181][hyfi.pipeline.config][INFO] - Running a pipe with hyfi.pipe.general_external_funcs
[2023-08-15 16:02:37,183][hyfi.pipeline.config][INFO] - Returning partial function: hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail with kwargs: {'_target_': 'hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail', 'num_heads': 5, 'num_tails': 5, 'columns': ['id', 'text', 'tokens', 'adjnouns', 'predicates'], 'verbose': True}
[2023-08-15 16:02:37,184][hyfi.composer.composer][INFO] - instantiating hyfi.utils.datasets.basic.DSBasic.dataframe_print_head_and_tail ...
[2023-08-15 16:02:37,186][hyfi.utils.datasets.basic][INFO] - Printing head and tail of dataframe
[2023-08-15 16:02:37,186][hyfi.utils.datasets.basic][INFO] - Head:
          id  ...                                         predicates
0     743741  ...  [set-listed, say, joint, recently, form, would...
1  501295606  ...  [aim, attract, chinese, take, make, easy, trav...
2      12732  ...  [cambodian, award, japanese-owned, construct, ...
3   50618111  ...  [fret, currently, roil, bad, front, centre, in...
4   50747762  ...  [especially, southern, eastern, central, batte...

[5 rows x 5 columns]
[2023-08-15 16:02:37,196][hyfi.utils.datasets.basic][INFO] - Tail:
            id  ...                                         predicates
995  501163252  ...  [lose, lose, close, open, record, high, main, ...
996  501319124  ...  [cambodian, set, new, regard, foreign, allow, ...
997  501072218  ...  [recent, show, extreme, halve, higher, rural, ...
998  501282549  ...  [publicly-listed, autonomous, start, new, upgr...
999   50657009  ...  [announce, steve, force, show, poor, engage, c...

[5 rows x 5 columns]
 Change directory back to /raid/cis/yjlee/workspace/logs/hydra/nbcpu/2023-08-15/2023-08-15_16-02-13
[2023-08-15 16:02:37,205][hyfi.task.task][INFO] -  >> elapsed time for the task with 1 pipelines: 0:00:19.859572