Uncovering Companies Missing from the SABI Database: A Web Scraping Approach

Authors

  • Xin-Hui Huang Universitat Politècnica de València Dept. Economics and Social Sciences 46022 València (Spain) image/svg+xml
  • Josep Domenech Universitat Politècnica de València Dept. Economics and Social Sciences 46022 València (Spain)

DOI:

https://doi.org/10.3145/epi.2025.ene.34202

Abstract

This study evaluates the completeness and representativeness of the SABI database, a widely used commercial source for firm-level data in Spain and Portugal, by comparing it to BORME, the official Spanish business register. Using web scraping techniques, we collected and processed approximately 100,000 BORME publications in PDF format, covering the period from 2010 to 2023. These were transformed into a structured dataset comprising over 1.2 million companies, which we then matched against SABI records from the same period. Our analysis reveals that SABI covers only 38.3% of newly established companies, with significant underrepresentation of younger firms, small enterprises, specific sectors, and certain regions. Furthermore, we find clear evidence of survivorship bias: the longer a company has been dissolved, the less likely it is to appear in SABI. Sectoral and geographic disparities are also substantial, and the coverage is skewed toward firms with higher initial capital and specific legal forms. These findings suggest that SABI represents a non-random subset of the Spanish business population, and caution should be exercised when using it for empirical research. Adjustments for sample bias are recommended to improve the reliability of analyses based on this database.

Downloads

Download data is not yet available.

Downloads

Published

2025-07-23

How to Cite

Huang, X.-H., & Josep Domenech. (2025). Uncovering Companies Missing from the SABI Database: A Web Scraping Approach. Profesional De La información, 34(2). https://doi.org/10.3145/epi.2025.ene.34202

Issue

Section

Research articles