Crawling Khmer Times#

Overview#

The dataset for Khmer Times is prepared through a systematic process involving web scraping, article text extraction, and data serialization. This section provides a detailed explanation of each step, along with the specific workflow to crawl the news articles from the Khmer Times.

1. Web Scraping#

1.1 Structure and Approach#

The web scraping is performed using Python’s requests and BeautifulSoup libraries. The search results pages of the Khmer Times website are crawled to extract the URLs of the news articles.

  • Tags and Classes: Articles are identified within an <article> tag with the class item item-media. The title is found in an <h2> tag with the class item-title, and the URL is within an <a> tag inside the <h2> tag.

  • Functionality: A Python function iteratively crawls the search results pages using a search keyword until a 404 error is encountered, indicating no more results.

  • Output: The title and URL of each article are extracted and stored in a list of dictionaries.

2. Article Text Extraction#

2.1 Structure and Approach#

A separate Python function follows the URLs extracted from the previous step to scrape the text of the articles.

  • Tags and Classes: The text is found within <p> tags inside a <div> tag with the class entry-content. Categories are within <a> tags in a <div> with the class entry-meta, and the publication time is in a <time> tag within the same div.

  • Functionality: The function iterates over the URLs, sending a GET request to each and parsing the HTML to extract the text, categories, and publication time.

  • Output: The extracted data is stored in a list of dictionaries.

3. Data Serialization#

The list of dictionaries is serialized to a JSON file using Python’s json library.

  • Datetime Conversion: The datetime objects for publication times are converted to ISO 8601 formatted strings, as they are not JSON serializable.

  • JSON Structure: The resulting JSON file contains an array of objects, with fields for text, categories, and time.

This JSON file serves as the dataset for the project.

Crawling Workflow#

Configuration and Commands#

The crawling configuration is located in the src/nbcpu/conf/fetcher directory. To print the configuration, use:

!nbcpu +fetcher=khmer_all dryrun=true
Hide code cell output
## Command Line Interface for HyFI ##
{'about': {'authors': 'Young Joon Lee <entelecheia@hotmail.com>',
           'description': 'Quantifying Central Bank Policy Uncertainty in a '
                          'Highly Dollarized Economy: A Topic Modeling '
                          'Approach',
           'homepage': 'https://nbcpu.entelecheia.ai',
           'license': 'MIT',
           'name': 'Measuring Central Bank Policy Uncertainty'},
 'debug_mode': False,
 'dryrun': True,
 'fetcher': {'_config_group_': '/fetcher',
             '_config_name_': 'khmer_all',
             '_target_': 'nbcpu.fetcher.khmer.KhmerFetcher',
             'article_filename': 'articles.jsonl',
             'delay_between_requests': 0.0,
             'key_field': 'url',
             'link_filename': 'links.jsonl',
             'max_num_articles': None,
             'max_num_pages': None,
             'num_workers': 2,
             'output_dir': 'workspace/datasets/fetcher/khmer',
             'overwrite_existing': False,
             'print_every': 10,
             'search_keywords': ['NBC',
                                 'Exchange Rate',
                                 'De-dollarization',
                                 'Inflation',
                                 'GDP',
                                 'Monetary Policy',
                                 'Finance',
                                 'Banking',
                                 'Stock Exchange',
                                 'Uncertain',
                                 'Economic',
                                 'Policy',
                                 'Financial',
                                 'Riel',
                                 'Bank',
                                 'Economy',
                                 'Securities Exchange',
                                 'National Bank of Cambodia'],
             'search_url': 'https://www.khmertimeskh.com/page/{page}/?s={keyword}',
             'start_page': 1,
             'verbose': True},
 'hydra_log_dir': '/home/yjlee/.hyfi/logs/hydra',
 'ignore_warnings': True,
 'logging_level': 'WARNING',
 'noop': False,
 'resolve': True,
 'verbose': False,
 'version': '0.12.0'}

Dryrun is enabled, not running the HyFI config

To crawl the news articles, run:

!nbcpu +workflow=nbcpu tasks='[khmer_all]' \
    khmer_all.max_num_pages=1 khmer_all.max_num_articles=5 \
        khmer_all.search_keywords='[NBC]' \
            mode=__info__
[2023-08-15 14:50:35,465][hyfi.main.config][INFO] - HyFi project [nbcpu] initialized
[2023-08-15 14:50:36,516][nbcpu.fetcher.base][INFO] - Fetching links for keyword: NBC
[2023-08-15 14:50:36,517][nbcpu.fetcher.base][INFO] - [Keyword: NBC] Page: 1
[2023-08-15 14:50:36,903][nbcpu.fetcher.khmer][INFO] - Title: NBC to increase reserve requirements in foreign currency to 12.5%
[2023-08-15 14:50:36,903][nbcpu.fetcher.khmer][INFO] - URL: https://www.khmertimeskh.com/501335210/nbc-to-increase-reserve-requirements-in-foreign-currency-to-12-5/
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - Title: NBC inks deal with UnionPay to expand cross-border payment to China
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - URL: https://www.khmertimeskh.com/501322478/nbc-inks-deal-with-unionpay-to-expand-cross-border-payment-to-china/
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - Title: Rural credit institutions help people improve livelihoods, NBC says
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - URL: https://www.khmertimeskh.com/501287567/rural-credit-institutions-help-people-improve-livelihoods-nbc-says/
[2023-08-15 14:50:36,907][nbcpu.fetcher.base][INFO] - Reached max number of pages, stopping...
[2023-08-15 14:50:36,907][nbcpu.fetcher.base][INFO] - Finished fetching links for keyword: NBC
[2023-08-15 14:50:36,907][nbcpu.fetcher.base][INFO] - Total links fetched: 30
[2023-08-15 14:50:36,915][nbcpu.fetcher.base][INFO] - Removed 0 duplicate links from 30 links
[2023-08-15 14:50:36,916][nbcpu.fetcher.base][INFO] - Saved 30 links to /home/yjlee/workspace/projects/nbcpu/workspace/datasets/fetcher/khmer/links.jsonl
[2023-08-15 14:50:40,769][nbcpu.fetcher.base][INFO] - Reached max number of articles, stopping...
[2023-08-15 14:50:40,769][nbcpu.fetcher.base][INFO] - Finished scraping articles
[2023-08-15 14:50:40,769][nbcpu.fetcher.base][INFO] - Total articles scraped: 5
[2023-08-15 14:50:40,853][nbcpu.fetcher.base][INFO] - Reached max number of articles, stopping...
[2023-08-15 14:50:40,853][nbcpu.fetcher.base][INFO] - Finished scraping articles
[2023-08-15 14:50:40,853][nbcpu.fetcher.base][INFO] - Total articles scraped: 5
[2023-08-15 14:50:40,867][nbcpu.fetcher.base][INFO] - Removed 0 duplicate articles from 10 articles
[2023-08-15 14:50:40,868][nbcpu.fetcher.base][INFO] - Saved 10 articles to /home/yjlee/workspace/projects/nbcpu/workspace/datasets/fetcher/khmer/articles.jsonl

Output Structure#

Crawled articles are stored in a jsonl file, with each line representing a JSON object containing:

  • title: Title of the article

  • url: URL of the article

  • keyword: Keyword for which the article was found

  • categories: Categories of the article

  • time: Timestamp of the article

  • text: Text of the article

Example data can be loaded and printed as follows:

data = HyFI.load_jsonl(
    "/home/yjlee/workspace/projects/nbcpu/workspace/datasets/fetcher/khmer/articles.jsonl"
)
print(f"Number of articles: {len(data)}")
data[0]
Number of articles: 10
{'title': 'NBC to increase reserve requirements in foreign currency to 12.5%',
 'url': 'https://www.khmertimeskh.com/501335210/nbc-to-increase-reserve-requirements-in-foreign-currency-to-12-5/',
 'keyword': 'NBC',
 'categories': ['Business'],
 'time': '2023-08-02T07:18:54+07:00',
 'text': 'The National Bank of Cambodia (NBC) will increase the reserve requirements in foreign currency, especially US dollars of banks and financial institutions in the country to 12.5 percent in 2024 after this monetary policy instrument has been raised to nine percent since January 1, 2023 from seven percent during the pre-pandemic period, said an NBC report.\nHowever, the Semi-Annual Report 2023 released on Monday by NBC—Cambodia’s central bank and monetary authority—pointed out that the reserve requirements in riel would be kept unchanged at seven percent to encourage consumers to use the national currency more in the economy through higher possibility in releasing loans in riel to businesses and individuals.\nChea Chanto, former Governor of NBC, said in the report that NBC has increased the reserve requirements—a part of deposits that all banks and financial institutions are required to keep to be ready in response to any case that a number of depositors withdraw cash from the bank in a remarkable speed and/or at the same—thanks to improved economic activities in the country.\n“In participation in maintaining the price stability and boosting the economic growth of the country, monetary policies have been implemented very flexibly and prudentially. Liquidity has been released into the market and then absorbed back in response to the actual demand to ensure that banks and financial institutions have sufficient liquidity,” Chanto said.\n“The sufficient liquidity will enable those institutions to provide loans to consumers at a more reasonable interest rate and take pressures away from the exchange rate,” said Chanto, adding that improved economic activities and effective monetary instruments have increased the demand for, leading to no selling-buying intervention for US dollar by NBC in the first half of 2023.\nLiquidity means that it is as high as possible for consumers to withdraw deposits back when they are in need of use in their consumption and banks and financial institutions to have on hand to be ready in response to the demand for loans or withdrawal from consumers, which can be called the strong resilience of banks and financial institutions against shocks or crises.\nThe report further pointed out that the increase in the reserve requirements in foreign currency has absorbed an amount of foreign-currency liquidity from banks and financial institutions, but released the liquidity by approximately $1.7 billion in the first half of this year compared to the pre-crisis period when the rates were set at 12.5 percent and eight percent for reserve requirements in dollar and riel respectively.\nThe report released after NBC’s semi-annual meeting indicated that the reserve requirements in US dollar and riel have been about $4.3 billion and $340.4 million respectively in the first half of 2023, which has also increased 32 percent and 19 percent respectively compared to the same period of the previous year in line with the rising deposits and overseas loans.\n“If riel is used more and more broadly, the implementation of the central bank’s monetary policies would be more and more effective in further boosting economic activities. However, the participation from all stakeholders including public institutions, private firms and the people is still the key to success in expanding the use of riel in all sectors,” Chanto added.\nSpeaking to Khmer Times yesterday, In Channy, President & Group Managing Director of Acleda Bank Plc, said that NBC had decreased the reserve requirements in foreign currency to seven percent during Covid-19 pandemic from 12.5 percent before the health crisis, but then increased to nine percent since early this year and will raise the rate to 12.5 percent next year.\n“It is an incentive for riel as the central bank has implemented the monetary policy by having kept the reserve requirements in national currency unchanged, which would enable banks and financial institutions to increase loan release in riel that will be used more and more in the economy. It is not about being fair or unfair. It is about policy implementation,” said Channy.\nCambodia’s banking system consists of 59 commercial banks, nine specialised banks, five deposit-taking microfinance institutions (MDIs), 82 microfinance institutions (MFIs), 118 leasing firms, 35 payment institutions, one credit information-sharing service providers, six representative offices of foreign banks and 2,890 money exchangers, according to the report.\n\xa0'}