ISSN: 2182-2069 (printed) / ISSN: 2182-2077 (online)
Understanding the Effectiveness of SBOM Generation Tools for Manually Installed Packages in Docker Containers
Software Bill of Materials (SBOM), which is a standardized format for the machine-readable list of components included in software, is a key technology for addressing software supply chain attacks. Since Docker containers, now prevalent for software distribution and deployment, typically consists of hundreds of packages, the use of automation tools to generate their SBOMs is recommended. Currently, several OSS-based SBOM generation tools are available, playing indispensable roles in automating SBOM utilization. Generally, the tools make use of information from several package managers and databases of popular software to create SBOMs from the container images. On the other hand, some Docker containers include packages that were manually downloaded and installed by the authors without the package managers. Despite this, few studies have been conducted on how pervasive manually installed packages are and how accurate SBOM generation tools are in identifying them. To investigate the issue, we collected 3500+ popular Docker container images from the Docker Hub and assessed the accuracy of the SBOMs generated by two prominent OSS tools. The result showed 3000+ manual installations of 800+ packages that are either downloaded with Linux commands or copied directly from host systems. 51% of the containers included one or more manually installed packages. We found that SBOM tools can overlook 30-70% of the installations, which include both recent and outdated versions of major software and many niche or specialized tools. In addition, at least 27.7% of the manually installed packages were executed or read using the default settings of the Docker containers, and neither of the tool identified 22.7% of them, including those with known CVE vulnerabilities. Finally, the results revealed that at least 1.1% of the installations are overlooked by the generators, although actively used and associated with known vulnerabilities.