Lab: Crawling DART Data#
In this lab, we will learn how to crawl financial data from the DART system (Data Analysis, Retrieval, and Transfer System) provided by the Financial Supervisory Service of South Korea. We will use the OpenDartReader library to access the Open DART API, which provides financial information about Korean companies. Additionally, we will extract specific sections from the financial statements using BeautifulSoup.
Preparation#
First, let’s set up the workspace and required directories:
import os, sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
from google.colab import drive
if not os.path.exists('/content/drive'):
drive.mount("/content/drive")
!ln -s "/content/drive/My Drive/colab_workspace" workspace
WORKSPACE_DIR = "/content/workspace/projects/dart"
else:
WORKSPACE_DIR = "examples/dart"
print(f'WORKSPACE_DIR = {WORKSPACE_DIR}')
data_dir = os.path.join(WORKSPACE_DIR, "data")
os.makedirs(data_dir, exist_ok=True)
WORKSPACE_DIR = examples/dart
Introduction to OpenDartReader#
OpenDartReader is an open-source Python library that makes it easy to use the Open DART API. The Open DART API provides access to financial data from the Korean Financial Supervisory Service’s electronic disclosure system. This API is well-designed and offers a range of useful features. However, using the API directly can be cumbersome, as it requires additional work, such as converting received data into a more usable format like Pandas DataFrame.
OpenDartReader simplifies these tasks by providing a user-friendly interface for accessing the Open DART API. It also offers utility functions for downloading and processing attachments and documents, making it easy to retrieve financial statements in Excel format, for example.
Installation#
To install OpenDartReader, simply run:
pip install opendartreader
If you need to upgrade an existing installation, use:
pip install --upgrade opendartreader
Quick Start#
First, import the OpenDartReader library and create an instance using your API key:
# %%capture
%pip install opendartreader python-dotenv
import OpenDartReader
api_key = "your_api_key_here"
dart = OpenDartReader(api_key)
Accessing Public Disclosure Information#
You can access the public disclosure information of a specific company (e.g., Samsung Electronics) using the list()
method:
# All public disclosures for Samsung Electronics since its IPO
dart.list('삼성전자') # Use either the company name or the stock code (e.g., '005930')
You can also specify a date range for retrieving public disclosures:
# Public disclosures for Samsung Electronics after a specific date
dart.list('005930', start='2022-01-01') # Disclosures from 2022-01-01 to today
# Public disclosures for Samsung Electronics within a specific date range
dart.list('005930', start='2022-04-28', end='2022-04-28')
Additionally, you can retrieve specific types of disclosures, such as annual reports, by specifying the kind
parameter:
# All annual reports (including corrected ones) for Samsung Electronics since 1999
dart.list('005930', start='1999-01-01', kind='A', final=False)
# All final annual reports for Samsung Electronics since 1999
dart.list('005930', start='1999-01-01', kind='A')
Accessing Company Overview Information#
To obtain a company’s overview information, use the company()
method:
# Overview information for Samsung Electronics
dart.company('005930')
You can also search for companies with a specific name using the company_by_name()
method:
# Overview information for companies with "Samsung Electronics" in their name
dart.company_by_name('삼성전자')
Accessing Disclosure Documents#
To access the original disclosure document in XML format, use the document()
method:
# Samsung Electronics' 2022 semi-annual business report in XML format
xml_text = dart.document('20220816001711')
To retrieve a list of all documents related to a specific disclosure, such as the business report and audit report, use the document_all()
method:
xml_text_list = dart.document_all('20220816001711')
xml_text = xml_text_list[0]
Finding Corporate Codes#
You can find a company’s unique corporate code using the find_corp_code()
method:
# Find corporate code using the stock code
dart.find_corp_code('005930')
# Find corporate code using the company name
dart.find_corp_code('삼성전자')
Download and Extract Company Disclosure from DART#
First, load the DART API key from the .env file:
from dotenv import load_dotenv, find_dotenv
dotenv_path = WORKSPACE_DIR + "/.env"
load_dotenv(dotenv_path)
DART_API_KEY = os.environ.get("DART_API_KEY")
dart = OpenDartReader(DART_API_KEY)
Next, retrieve a list of Samsung Electronics’ annual reports since 1999:
dart.list("005930", start="1999-01-01", kind="A")
Now, we can sort the documents by their relevance to the search query using the sub_docs()
method:
sub_docs = dart.sub_docs("20220308000798", match="이사의 경영진단 및 분석의견")
sub_docs
Finally, obtain the URL for the most relevant document:
url = sub_docs.url[sub_docs.index[0]]
Crawling the MD&A Section from the Financial Statement#
To extract the Management’s Discussion and Analysis (MD&A) section from the financial statement, we will use the BeautifulSoup library. First, install the library:
pip install beautifulsoup4
Now, import the necessary libraries and fetch the target URL’s content:
%pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
soup.get_text()
This code retrieves the text content of the M&DA section from the financial statement, which can be further processed and analyzed as needed.