1996 年,美国证券交易委员会 (SEC, Security Exchange Commission) 规定所有的信息披露义务人(美国上市公司)都必须进行电子化入档。EDGAR 系统随之应运而生,其全称为 Electronic Data Gathering, Analysis and Retrieval system,即电子化的收集、分析和获取披露信息的系统。SEC 官方网站对 EDGAR 的介绍中明确提出,建立 EDGAR 系统的目的是为电子化入档人提供便利,提高美国证监会信息处理的速度和效率,使投资者、金融机构和其他人士能够及时获得市场信息。根据要求,信息披露义务人必须向 EDGAR 系统注册并披露各类信息,包括阶段性财务报告(如 10-Q 季报、10-K 年报等)和其他应披露文件。
1.2 研究中的 EDGAR 系统
研究表明,作为信息技术外生冲击,EDGAR 系统的实施大幅降低了公司信息披露成本,改善了投资者与企业的信息不对称,增加了公司股票流动性,降低了股本成本,提高了股权融资水平,从而提升了公司的经营业绩 (Goldstein, Yang and Zuo, 2023; Gomez, 2023)。另外,Ni, Wang and Yin (2021) 发现 EDGAR 系统的实施导致了股票流动性的增加和投资者对公开披露的依赖增加,从而加剧了公司管理层隐瞒坏消息的动机。
SEC 提供了 API 接口的详细说明,感兴趣的同学可以去这个链接 (https://www.sec.gov/edgar/sec-api-documentation) 查看。由于篇幅关系,在此次推文中,我们只介绍Python获取 EDGAR 数据的部分基本样例,更全面的一站式数据下载方法可查阅参考资料给出的链接。
2.1 CIK信息获取
EDGAR 在申报人向美国证券交易委员会注册并披露文件时,会为其分配一个唯一的10位数字标识符,称为中央索引密钥 (Central Index Key, CIK)。CIK 编号对申报人来说是唯一的,它们不会被回收。我们可以通过以下Python代码获取公司的 CIK 信息。
# import modules import requests import pandas as pd # create request header # please input your own email address headers = {'User-Agent': "hw2258@bath.ac.uk"} # get all companies data companyTickers = requests.get( "https://www.sec.gov/files/company_tickers.json", headers=headers ) # dictionary to dataframe companyData = pd.DataFrame.from_dict(companyTickers.json(),orient='index') # add leading zeros to CIK companyData['cik_str'] = companyData['cik_str'].astype(str).str.zfill(10)
2.2 披露信息概览
通过以下代码,我们可以获取美国苹果公司近期的 10-K 文件披露信息。
CIK = companyData[companyData['ticker']=="AAPL"]['cik_str'][0] # get company specific filing metadata filings = requests.get( f'https://data.sec.gov/submissions/CIK{CIK}.json', headers=headers ) # dictionary to dataframe filingsForms = pd.DataFrame.from_dict( filings.json()['filings']['recent'] ) # filter only Annual reports annualForms = filingsForms[filingsForms['form']=='10-K']
2.3 XBRL 数据API
可扩展业务标记语言 (Extensible Business Markup Language, XBRL) 是一种基于XML的财务报表报告格式,被美国证券交易委员会和世界各地的金融监管机构使用。如果有同学有过基金公司运营或合规部门的实习经验,肯定经常听到 XBRL 这个名词。
XBRL includes Concepts, Taxonomies, Values, Contexts, Facts, Instances and Dimensions. By combining a concept (profit) from a taxonomy (say Canadian GAAP) with a value (1000) and the needed context (Acme Corporation, for the period 1 January 2015 to 31 January 2015 in Canadian Dollars) we arrive at a fact. Collections of facts in XBRL are contained in documents called instances.
当然,不知道这些细节并不妨碍我们获取数据。
通过以下 Python 代码可获取一些财务信息。
# get company facts data # companyfacts API returns all the company concepts data for a company into a single API call companyFacts = requests.get( f'https://data.sec.gov/api/xbrl/companyfacts/CIK{CIK}.json', headers=headers ) # get the current assets values curr_assets_df_1 = pd.DataFrame(companyFacts.json()["facts"]["us-gaap"]["AssetsCurrent"]["units"]["USD"])
# get company concept data # company-concept API returns all the XBRL disclosures from a single company (CIK) and concept\ # (a taxonomy and tag) into a single JSON file, with a separate array of facts for each units \ # on measure that the company has chosen to disclose (e.g. net profits reported in \ # U.S. dollars and in Canadian dollars). companyConcept = requests.get( ( f'https://data.sec.gov/api/xbrl/companyconcept/CIK{CIK}' f'/us-gaap/AssetsCurrent.json' ), headers=headers ) curr_assets_df_2 = pd.DataFrame(companyConcept.json()["units"]["USD"])
2.4 报告文件下载
前面我们介绍的都是如何通过 EDGAR Data API 获取信息。而在实际研究中,我们最常用的应该还是通过Python爬取 EDGAR 网站下载披露文件并解析数据。值得注意的是,对 EDGAR 数据访问,美国证监会限制为每秒钟 10 个请求,以确保每个客户端都能够公平获取数据。由于此部分代码较长,大家可访问参考资料中的链接获取相关文件。
3. 参考资料
Azimi, M., & Agrawal, A. (2021). Is Positive Sentiment in Corporate Annual Reports Informative? Evidence from Deep Learning. The Review of Asset Pricing Studies, 11(4), 762–805. -Link-, -PDF-.
Bodnaruk, A., Loughran, T., & McDonald, B. (2015). Using 10-K Text to Gauge Financial Constraints. Journal of Financial and Quantitative Analysis, 50(4), 623–646. -Link-, -PDF-.
Gao, M., & Huang, J. (2019). Informing the Market: The Effect of Modern Information Technologies on Information Production. The Review of Financial Studies, 33(4), 1367–1411. -Link-, -PDF-.
Goldstein, I., Yang, S., & Zuo, L. (2023). The Real Effects of Modern Information Technologies: Evidence from the EDGAR Implementation. Journal of Accounting Research, 61(5), 1699–1733. Portico. -Link-, -PDF-.
Gomez, E. A. (2023). The Effect of Mandatory Disclosure Dissemination on Information Asymmetry among Investors: Evidence from the Implementation of the EDGAR System. The Accounting Review, 1–23. -Link-, -PDF-.
Griffin, P.A., 2003. Got information? Investor response to Form 10-K and Form 10-Q EDGAR filings. Review of Accounting Studies, 8, pp.433-460.
Muslu, V., Radhakrishnan, S., Subramanyam, K.R. and Lim, D., 2015. Forward-looking MD&A disclosures and the information environment. Management Science, 61(5), pp.931-948.
Ni, X., Wang, Y. and Yin, D., 2021. Does modern information technology attenuate managerial information hoarding? evidence from the Edgar implementation. Journal of Corporate Finance, 71, p.102100.
Python Workshop on Web Data Extraction. -Link-
EDGAR-CRAWLER: Unlock the Power of Financial Documents. Github: https://github.com/nlpaueb/edgar-crawler.
EDGAR-CORPUS on Zenodo. EDGAR-CORPUS: The biggest corpus for financial NLP research, built from EDGAR-CRAWLER https://zenodo.org/record/5528490
EDGAR-CORPUS on HuggingFace datasets. https://huggingface.co/datasets/eloukas/edgar-corpus/
Financial Word2Vec Embeddings. EDGAR-W2V: `Word2vec`` Embeddings trained on EDGAR-CORPUS. https://zenodo.org/record/5524358