When working with repositories and blob stores, you may want to have some insight into how the storage space is being used.

This article provides a few different ways to find out where repository and blob store space is being consumed.

Using Queries on the OrientDB Component Backup to Report Asset Sizes

All of your asset metadata is stored in the OrientDB component database. Connecting to and running queries against the database while simultaneously being used by the Repository process is strongly discouraged.

However you can certainly safely run queries against a recent component database backup not in use by other processes.

Report of Total Asset Record Sizes Per Repository

Create a database backup using the Admin - Export databases for backup task

Download a stand-alone OrientDB executable console jar to the host where the backups reside:
Download orient-console.jar

curl -O -L https://github.com/sonatype-nexus-community/nexus-monitoring/raw/refs/heads/main/resources/orient-console.jar

Execute a query against the component db backup that will generate the report. Adjust the values as needed:
- replace 'maven-central' with the id of your repository
- replace blob_updated value with your own date, or remove that criteria
- replace last_downloaded value with your own date or remove that criteria
- replace extractDir value to a path where there is enough disk to unpack the bak zip file
- replace exportPath value to a file location where you want the output generated
```
echo "SELECT bucket.repository_name as r, count(*) as asset_count, \
 sum(size) as asset_size FROM asset WHERE bucket.repository_name = 'maven-central' \
 AND blob_updated <= '2021-09-29' AND last_downloaded <= '2021-11-28' \
 GROUP BY bucket" | java -Xmx4g -DextractDir=./component -DexportPath=./result.json \
 -jar ./orient-console.jar /your/database/backup/path/component-2021-11-17-22-00-00-3.34.0-01.bak
```

Sample output in the result file

cat ./result.json
[
  {"@type":"d","@rid":"#-2:1","@version":0,"r":"maven-central","asset_count":30,"asset_size":2968853,"@fieldTypes":"asset_count=l,asset_size=l"}
]

NOTE: If you see "java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.Native" error, this can be caused by "noexec" flag on /tmp. If you do not want to change the mount flag on /tmp, you can change the temp directory location by adding "-Djava.io.tmpdir=/new/tmp/location".

Alternative SQL that you may find useful when looking for largest repos on specific blobstore

SELECT bucket.repository_name as repository, format, count(*) as asset_count, sum(size) as asset_size 
FROM asset WHERE blob_ref like 'maven-blobstore@%' 
GROUP BY bucket
ORDER BY asset_size DESC LIMIT 30

Using Groovy Scripting to Report Blob Sizes

The following options require that scripting is enabled inside Repository 3. Enabling and using scripting has risk. Newer versions of Repository have scripting is disabled by default. Use scripts at your own risk. Sonatype does not guarantee forward compatibility to future versions of Nexus Repo 3. Sonatype recommends testing scripts in a non-production environment first.

Estimating the Effects of Cleanup Policy Criteria

Use Case

One may want to estimate the amount of storage space that could potentially be cleaned up by using certain Cleanup Policy criteria.

Download the Script

The groovy script source to generate this report can be found here: EstimateCleanupPolicyEffects.groovy

Running the Script

The script can be executed as a task in Nexus Repository Manager.

Enable Scripting
Under Administration > System > Tasks. Create a new Admin - Execute Script task.
Language: groovy
Task frequency: Manual
Source: paste the contents of EstimateCleanupPolicyEffects.groovy into the field.
(Optional) Customize the script source defined options at the top of the file to match your proposed cleanup policy criteria.
Save and Run the task.

Examining the Output

Check the nexus.log for the script output which reports per blobstore and per repository results of asset counts and byte size of storage that would be included in the criteria. Example output:

{
  "blobstores": {
    "default": {
      "count": 2,
      "size": 51381
    }
  },
  "repositories": {
    "raw-hosted": {
      "count": 1,
      "size": 24927
    },
    "raw-proxy": {
      "count": 1,
      "size": 26454
    }
  }
}

Listing the Size of File Based Blobs in Repositories / Blobstores

Use Case

For Nexus version 3.70 and lower, and for file-based blob stores (not cloud based), it is possible to run a script against the actual blobs on disk, not in the database to report actual used blob storage and also how much potential space could be reclaimed by hard-deleting already soft-deleted blobs.

Download the Script

The groovy script source to generate this report can be found here: nx-blob-repo-space-report-20220510.groovy

Running The Script

The script can be executed as a task in Nexus Repository Manager.

Enable Scripting
Under Administration > System > Tasks. Create a new Admin - Execute Script task.
Language: groovy
Task frequency: Manual
Source: paste the contents of nx-blob-repo-space-report-20220510.groovy into the field.
(Optional, Recommended) Customize the script source defined options, in particular the options to include or exclude specific repositories.
Important: If your instance has a large number of repositories, please utilize "REPOSITORY_WHITELIST" and "REPOSITORY_BLACKLIST" to reduce the execution time and memory usage by this task.
Save and Run the task.

Examining the Report

When you execute the task, the output within nexus.log will look similar to the following (the directories scanned will differ):

*SYSTEM Script47 - Blob Storage scan STARTED.
*SYSTEM Script47 - Scanning /home/nexus/sonatype-work/nexus3/blobs/default
*SYSTEM Script47 - Scanning /opt/nexus/test2
*SYSTEM Script47 - Scanning /home/nexus/sonatype-work/nexus3/blobs/test1
*SYSTEM Script47 - Blob Storage scan ENDED. Report at /home/nexus/sonatype-work/nexus3/tmp/repoSizes-20181213-104154.json

You should be able to find the generated JSON report at the location provided in the log - the actual location will vary according to your configuration:

Report at /home/nexus/sonatype-work/nexus3/tmp/repoSizes-20181213-104154.json

Within the JSON report, there are details of each blob store and each repository that uses the blob store. For example, the output below shows two blob stores, each having a single repository:

{
    "blobstore1": {
        "repositories": {
            "repositoryA": {
                "reclaimableBytes": 0,
                "totalBytes": 4173387
            }
        },
        "totalBlobStoreBytes": 4173387,
        "totalReclaimableBytes": 0,
        "totalRepoNameMissingCount": 0
    },
    "blobstore2": {
        "repositories": {
            "repositoryB": {
                "reclaimableBytes": 0,
                "totalBytes": 1397598
            }
        },
        "totalBlobStoreBytes": 1397598,
        "totalReclaimableBytes": 0,
        "totalRepoNameMissingCount": 0
    }
}

For each repository, totalbytes indicates how much space is being used and reclaimableBytes indicates how much space may be reclaimed by running the Compact Blob Store maintenance task.

For each blob store, all of the repository entries are aggregated. totalRepoNameMissingCount will display how many assets within the blob store are associated with a repository that no longer exists.

The report will also include Repositories that are empty.

Problems Running the Script

Caused by: java.lang.NoClassDefFoundError: org/apache/tools/ant/BuildLogger

The message suggests that you are using an older version of the script incompatible with your current version of Nexus Repository Manager. Please download and use the latest version from this article and try running the script again.

OutOfMemoryError

For large instances, or instances where overall heap memory available is constrained, this script can cause the heap memory to fill and the script will fail triggering out of memory errors. For this reason the options to limit which repositories to process should be used.

Taking a long time to complete

If the script takes many hours or days even to complete, use the limit options at the top of the script to limit the scope of the script.

Finding the Largest Blobs Within a Blob Store

The groovy script below will help find the largest blobs inside a blob store directory. The output will contain a list of blobs larger than 100M sorted by size. Adjust min_size as required.

Execute script from Task

An alternate version of the script below that can be executed as a task in Nexus Repository Manager.

Enable Scripting

Under Administration > System > Tasks. Create a new Admin - Execute Script task.
Language: groovy
Task frequency: Manual
Source: paste the contents of the script in this field

import org.sonatype.nexus.repository.storage.StorageFacet
import org.sonatype.nexus.repository.Repository
import org.sonatype.nexus.repository.storage.Asset
import groovy.json.JsonOutput

long min_size = 100000000

repository.repositoryManager.browse().each { Repository repo ->
    StorageFacet storageFacet = repo.facet(StorageFacet)
    log.info("#### Repository: " + repo.getName() + " ####")
    def tx = storageFacet.txSupplier().get()
    def results = [:].withDefault { 0 }
	try {
    	tx.begin()    
    
    	tx.browseAssets(tx.findBucket(repo)).each { Asset asset ->
      	if (asset.size() > min_size) {
        	results.put(asset.name(),asset.size())
      	}
   	 }
	} finally {
    	tx.close()
    }
    def sorted = results.sort { a, b -> b.value <=> a.value }
    log.info(JsonOutput.prettyPrint(JsonOutput.toJson(sorted)))
}

(Optional) Customize the script source options, in particular the options for min_size.
Run the task.
Examine the nexus.log for the output.

Execute script from command line

Execute the script from the command line making sure

Groovy is installed on the host where it is run
dir_name points to the correct path for your repository 3 data directory
user executing the script has permissions to the file system
optionally redirect the output to a file for further processing

long min_size = 100000000
String dir_name = '/opt/Nexus/sonatype-work/nexus3'

def ant = new AntBuilder()
def scanner = ant.fileScanner {
  fileset(dir: dir_name) {
    include(name: '**/blobs/**/*.properties')
    exclude(name: '**/metadata.properties')
    exclude(name: '**/*metrics.properties')
    exclude(name: '**/tmp')
  }
}
def results = [:].withDefault { 0 }
scanner.each { File file ->
  def properties = new Properties()
  file.withInputStream { is ->
    properties.load(is)
  }
  long prop_size = properties.size as long;
  if (prop_size > min_size) {
    results.put(properties['@BlobStore.blob-name'], prop_size)
  }
}
def sorted = results.sort { a, b -> b.value <=> a.value }

sorted.each{ k, v -> println "${k}:${v}" }

Obtain Repo Sizes for Postgres Deployments

The following python script can be used to get the size of repos from a Postgres DB. You will need to set the DB connection params:

#!/usr/bin/env python3

import os
from hurry.filesize import size
DB_HOST = os.getenv("NS_DB_HOST")
DB_USER = os.getenv("NS_DB_USER")
DB_PASS = os.getenv("NS_DB_PASS")
DB_DB = os.getenv("NS_DB_DB")
DB_PORT = int(os.getenv("NS_DB_PORT")) if os.getenv("NS_DB_PORT") else 5432
if DB_HOST is None or DB_USER is None or DB_PASS is None or DB_DB is None:
        print("Please set the following environment variables:\nNS_DB_HOST\nNS_DB_USER\nNS_DB_PASS\nNS_DB_DB")
        print("Exiting...")
        exit(1)
conn = psycopg2.connect(
        host=DB_HOST,
        database=DB_DB,
        user=DB_USER,
        password=DB_PASS,
        port=DB_PORT
)
cur = conn.cursor()
cur.execute("SELECT tablename FROM pg_catalog.pg_tables WHERE tablename LIKE '%_content_repository';")
content_repository_tables_names = [x[0] for x in cur.fetchall()]
repos_size = {}
for content_repo in content_repository_tables_names:
        repo_type = content_repo.replace("_content_repository","")
        cur.execute(f"""select r.name, sum(blob_size) from {repo_type}_asset_blob t_ab
        join {repo_type}_asset t_a on t_ab.asset_blob_id = t_a.asset_blob_id
        join {repo_type}_content_repository t_cr on t_cr.repository_id = t_a.repository_id
        join repository r on t_cr.config_repository_id = r.id
        group by r.name;""")
        repos_size.update(dict(cur.fetchall()))
sum = 0
for repo_name in repos_size.keys():
        sum += repos_size[repo_name]
        human_readable_size = size(repos_size[repo_name])
        print(f"{repo_name}: {human_readable_size}")
print(f"Sum: {size(sum)}")

Sample output:

Repo1: 2M
Repo2: 268M
Repo3: 192M
Sum: 462M

Bash script:

bash <(curl -sfL https://raw.githubusercontent.com/sonatype-nexus-community/nexus-monitoring/main/scripts/nrm3-db-test.sh --compressed) | tee /tmp/results.out

Sample output:

bash-5.1$ head /tmp/results.out
[version:PostgreSQL 13.18 (Ubuntu 13.18-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit]
# Count & Size per Repository (format, repo_name, count, bytes)
[format:rubygems, repo_name:rubygem-proxy, count:1, bytes:546]
[format:docker, repo_name:docker-mcr-proxy, count:13, bytes:320958564]
[format:docker, repo_name:docker-hosted, count:13, bytes:3419257]
[format:docker, repo_name:docker-proxy, count:37, bytes:1664079917]
[format:apt, repo_name:apt-mirror-microsoft-packages, count:2, bytes:30967]
[format:apt, repo_name:apt-hosted, count:1, bytes:3726]
[format:apt, repo_name:apt-proxy, count:22, bytes:17838926]
[format:nuget, repo_name:nuget-v2-proxy, count:2, bytes:93420]

EstimateCleanupPolicyEffects.groovy
4 KB Download
nx-blob-repo-space-report-20220510.groovy
10 KB Download
orient-console.jar
10 MB Download

Using Queries on the OrientDB Component Backup to Report Asset Sizes

Report of Total Asset Record Sizes Per Repository

Using Groovy Scripting to Report Blob Sizes

Estimating the Effects of Cleanup Policy Criteria

Use Case

Download the Script

Running the Script

Examining the Output

Listing the Size of File Based Blobs in Repositories / Blobstores

Use Case

Download the Script

Running The Script

Examining the Report

Problems Running the Script

Finding the Largest Blobs Within a Blob Store

Execute script from Task

Execute script from command line

Obtain Repo Sizes for Postgres Deployments

Related articles