# Crawling Khmer Times

## Overview

The dataset for Khmer Times is prepared through a systematic process involving web scraping, article text extraction, and data serialization. This section provides a detailed explanation of each step, along with the specific workflow to crawl the news articles from the Khmer Times.

### 1. Web Scraping

#### 1.1 Structure and Approach

The web scraping is performed using Python's `requests` and `BeautifulSoup` libraries. The search results pages of the Khmer Times website are crawled to extract the URLs of the news articles.

- **Tags and Classes:** Articles are identified within an `<article>` tag with the class `item item-media`. The title is found in an `<h2>` tag with the class `item-title`, and the URL is within an `<a>` tag inside the `<h2>` tag.
- **Functionality:** A Python function iteratively crawls the search results pages using a search keyword until a 404 error is encountered, indicating no more results.
- **Output:** The title and URL of each article are extracted and stored in a list of dictionaries.

### 2. Article Text Extraction

#### 2.1 Structure and Approach

A separate Python function follows the URLs extracted from the previous step to scrape the text of the articles.

- **Tags and Classes:** The text is found within `<p>` tags inside a `<div>` tag with the class `entry-content`. Categories are within `<a>` tags in a `<div>` with the class `entry-meta`, and the publication time is in a `<time>` tag within the same div.
- **Functionality:** The function iterates over the URLs, sending a GET request to each and parsing the HTML to extract the text, categories, and publication time.
- **Output:** The extracted data is stored in a list of dictionaries.

### 3. Data Serialization

The list of dictionaries is serialized to a JSON file using Python's `json` library.

- **Datetime Conversion:** The `datetime` objects for publication times are converted to ISO 8601 formatted strings, as they are not JSON serializable.
- **JSON Structure:** The resulting JSON file contains an array of objects, with fields for `text`, `categories`, and `time`.

This JSON file serves as the dataset for the project.


## Crawling Workflow

### Configuration and Commands

The crawling configuration is located in the `src/nbcpu/conf/fetcher` directory. To print the configuration, use:


In [16]:
!nbcpu +fetcher=khmer_all dryrun=true

## Command Line Interface for HyFI ##
{'about': {'authors': 'Young Joon Lee <entelecheia@hotmail.com>',
           'description': 'Quantifying Central Bank Policy Uncertainty in a '
                          'Highly Dollarized Economy: A Topic Modeling '
                          'Approach',
           'homepage': 'https://nbcpu.entelecheia.ai',
           'license': 'MIT',
           'name': 'Measuring Central Bank Policy Uncertainty'},
 'debug_mode': False,
 'dryrun': True,
 'fetcher': {'_config_group_': '/fetcher',
             '_config_name_': 'khmer_all',
             '_target_': 'nbcpu.fetcher.khmer.KhmerFetcher',
             'article_filename': 'articles.jsonl',
             'delay_between_requests': 0.0,
             'key_field': 'url',
             'link_filename': 'links.jsonl',
             'max_num_articles': None,
             'max_num_pages': None,
             'num_workers': 2,
             'output_dir': 'workspace/datasets/fetcher/khmer',
             'overwrite_existi

To crawl the news articles, run:


In [21]:
!nbcpu +workflow=nbcpu tasks='[khmer_all]' \
    khmer_all.max_num_pages=1 khmer_all.max_num_articles=5 \
        khmer_all.search_keywords='[NBC]' \
            mode=__info__

[2023-08-15 14:50:35,465][hyfi.main.config][INFO] - HyFi project [nbcpu] initialized
[2023-08-15 14:50:36,516][nbcpu.fetcher.base][INFO] - Fetching links for keyword: NBC
[2023-08-15 14:50:36,517][nbcpu.fetcher.base][INFO] - [Keyword: NBC] Page: 1
[2023-08-15 14:50:36,903][nbcpu.fetcher.khmer][INFO] - Title: NBC to increase reserve requirements in foreign currency to 12.5%
[2023-08-15 14:50:36,903][nbcpu.fetcher.khmer][INFO] - URL: https://www.khmertimeskh.com/501335210/nbc-to-increase-reserve-requirements-in-foreign-currency-to-12-5/
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - Title: NBC inks deal with UnionPay to expand cross-border payment to China
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - URL: https://www.khmertimeskh.com/501322478/nbc-inks-deal-with-unionpay-to-expand-cross-border-payment-to-china/
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - Title: Rural credit institutions help people improve livelihoods, NBC says
[2023-08-15 14:50:36,904][nbcpu

### Output Structure

Crawled articles are stored in a jsonl file, with each line representing a JSON object containing:

- `title`: Title of the article
- `url`: URL of the article
- `keyword`: Keyword for which the article was found
- `categories`: Categories of the article
- `time`: Timestamp of the article
- `text`: Text of the article

Example data can be loaded and printed as follows:


In [25]:
data = HyFI.load_jsonl(
    "/home/yjlee/workspace/projects/nbcpu/workspace/datasets/fetcher/khmer/articles.jsonl"
)
print(f"Number of articles: {len(data)}")
data[0]


Number of articles: 10


{'title': 'NBC to increase reserve requirements in foreign currency to 12.5%',
 'url': 'https://www.khmertimeskh.com/501335210/nbc-to-increase-reserve-requirements-in-foreign-currency-to-12-5/',
 'keyword': 'NBC',
 'categories': ['Business'],
 'time': '2023-08-02T07:18:54+07:00',
 'text': 'The National Bank of Cambodia (NBC) will increase the reserve requirements in foreign currency, especially US dollars of banks and financial institutions in the country to 12.5 percent in 2024 after this monetary policy instrument has been raised to nine percent since January 1, 2023 from seven percent during the pre-pandemic period, said an NBC report.\nHowever, the Semi-Annual Report 2023 released on Monday by NBC—Cambodia’s central bank and monetary authority—pointed out that the reserve requirements in riel would be kept unchanged at seven percent to encourage consumers to use the national currency more in the economy through higher possibility in releasing loans in riel to businesses and individ