.
When working with repositories and blob stores, you may want to have some insight into how the storage space is being used.
This article provides a few different ways to find out where repository and blob store space is being consumed.
Using Queries on the OrientDB Component Backup to Report Asset Sizes
All of your asset metadata is stored in the OrientDB component database. Connecting to and running queries against the database while simultaneously being used by the Repository process is strongly discouraged.
However you can certainly safely run queries against a recent component database backup not in use by other processes.
Report of Total Asset Record Sizes Per Repository
- Create a database backup using the Admin - Export databases for backup task
- Download a stand-alone OrientDB executable console jar to the host where the backups reside:
Download orient-console.jarcurl -O -kL https://sonatype.zendesk.com/hc/article_attachments/8137481277715/orient-console.jar;
- Execute a query against the component db backup that will generate the report. Adjust the values as needed:
- replace 'maven-central' with the id of your repository
- replace blob_updated value with your own date, or remove that criteria
- replace last_downloaded value with your own date or remove that criteria
- replace extractDir value to a path where there is enough disk to unpack the bak zip file
- replace exportPath value to a file location where you want the output generated
echo "SELECT bucket.repository_name as r, count(*) as asset_count, \
sum(size) as asset_size FROM asset WHERE bucket.repository_name = 'maven-central' \
AND blob_updated <= '2021-09-29' AND last_downloaded <= '2021-11-28' \
GROUP BY bucket" | java -Xmx4g -DextractDir=./component -DexportPath=./result.json \
-jar ./orient-console.jar /your/database/backup/path/component-2021-11-17-22-00-00-3.34.0-01.bak - Sample output in the result file
cat ./result.json
[
{"@type":"d","@rid":"#-2:1","@version":0,"r":"maven-central","asset_count":30,"asset_size":2968853,"@fieldTypes":"asset_count=l,asset_size=l"}
]
NOTE: If you see "java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.Native" error, this can be caused by "noexec" flag on /tmp. If you do not want to change the mount flag on /tmp, you can change the temp directory location by adding "-Djava.io.tmpdir=/new/tmp/location".
Using Groovy Scripting to Report Blob Sizes
The following options require that scripting is enabled inside Repository 3. Enabling and using scripting has risk. Newer versions of Repository have scripting is disabled by default. Use scripts at your own risk. Sonatype does not guarantee forward compatibility to future versions of Nexus Repo 3. Sonatype recommends testing scripts in a non-production environment first.
Estimating the Effects of Cleanup Policy Criteria
Use Case
One may want to estimate the amount of storage space that could potentially be cleaned up by using certain Cleanup Policy criteria.
Download the Script
The groovy script source to generate this report can be found here: EstimateCleanupPolicyEffects.groovy
Running the Script
The script can be executed as a task in Nexus Repository Manager.
- Enable Scripting
- Under Administration > System > Tasks. Create a new Admin - Execute Script task.
Language: groovy
Task frequency: Manual
Source: paste the contents of EstimateCleanupPolicyEffects.groovy into the field. - (Optional) Customize the script source defined options at the top of the file to match your proposed cleanup policy criteria.
- Save and Run the task.
Examining the Output
Check the nexus.log for the script output which reports per blobstore and per repository results of asset counts and byte size of storage that would be included in the criteria. Example output:
{ "blobstores": { "default": { "count": 2, "size": 51381 } }, "repositories": { "raw-hosted": { "count": 1, "size": 24927 }, "raw-proxy": { "count": 1, "size": 26454 } } }
Listing the Size of File Based Blobs in Repositories / Blobstores
Use Case
For file-based blob stores (not cloud based), it is possible to run a script against the actual blobs on disk, not in the database to report actual used blob storage and also how much potential space could be reclaimed by hard-deleting already soft-deleted blobs.
Download the Script
The groovy script source to generate this report can be found here: nx-blob-repo-space-report-20220510.groovy
Running The Script
The script can be executed as a task in Nexus Repository Manager.
- Enable Scripting
- Under Administration > System > Tasks. Create a new Admin - Execute Script task.
Language: groovy
Task frequency: Manual
Source: paste the contents of nx-blob-repo-space-report-20220510.groovy into the field. - (Optional, Recommended) Customize the script source defined options, in particular the options to include or exclude specific repositories.
Important: If your instance has a large number of repositories, please utilize "REPOSITORY_WHITELIST" and "REPOSITORY_BLACKLIST" to reduce the execution time and memory usage by this task. - Save and Run the task.
Examining the Report
When you execute the task, the output within nexus.log will look similar to the following (the directories scanned will differ):
*SYSTEM Script47 - Blob Storage scan STARTED.
*SYSTEM Script47 - Scanning /home/nexus/sonatype-work/nexus3/blobs/default
*SYSTEM Script47 - Scanning /opt/nexus/test2
*SYSTEM Script47 - Scanning /home/nexus/sonatype-work/nexus3/blobs/test1
*SYSTEM Script47 - Blob Storage scan ENDED. Report at /home/nexus/sonatype-work/nexus3/tmp/repoSizes-20181213-104154.json
You should be able to find the generated JSON report at the location provided in the log - the actual location will vary according to your configuration:
Report at /home/nexus/sonatype-work/nexus3/tmp/repoSizes-20181213-104154.json
Within the JSON report, there are details of each blob store and each repository that uses the blob store. For example, the output below shows two blob stores, each having a single repository:
{
"blobstore1": {
"repositories": {
"repositoryA": {
"reclaimableBytes": 0,
"totalBytes": 4173387
}
},
"totalBlobStoreBytes": 4173387,
"totalReclaimableBytes": 0,
"totalRepoNameMissingCount": 0
},
"blobstore2": {
"repositories": {
"repositoryB": {
"reclaimableBytes": 0,
"totalBytes": 1397598
}
},
"totalBlobStoreBytes": 1397598,
"totalReclaimableBytes": 0,
"totalRepoNameMissingCount": 0
}
}
For each repository, totalbytes indicates how much space is being used and reclaimableBytes indicates how much space may be reclaimed by running the Compact Blob Store maintenance task.
For each blob store, all of the repository entries are aggregated. totalRepoNameMissingCount will display how many assets within the blob store are associated with a repository that no longer exists.
The report will also include Repositories that are empty.
Problems Running the Script
Caused by: java.lang.NoClassDefFoundError: org/apache/tools/ant/BuildLogger
The message suggests that you are using an older version of the script incompatible with your current version of Nexus Repository Manager. Please download and use the latest version from this article and try running the script again.
OutOfMemoryError
For large instances, or instances where overall heap memory available is constrained, this script can cause the heap memory to fill and the script will fail triggering out of memory errors. For this reason the options to limit which repositories to process should be used.
Taking a long time to complete
If the script takes many hours or days even to complete, use the limit options at the top of the script to limit the scope of the script.
Finding the Largest Blobs Within a Blob Store
The groovy script below will help find the largest blobs inside a blob store directory. The output will contain a list of blobs larger than 100M sorted by size. Adjust min_size as required.
Execute script from Task
An alternate version of the script below that can be executed as a task in Nexus Repository Manager.
- Enable Scripting
- Under Administration > System > Tasks. Create a new Admin - Execute Script task.
Language: groovy
Task frequency: Manual
Source: paste the contents of the script in this field
import org.sonatype.nexus.repository.storage.StorageFacet import org.sonatype.nexus.repository.Repository import org.sonatype.nexus.repository.storage.Asset import groovy.json.JsonOutput long min_size = 100000000 repository.repositoryManager.browse().each { Repository repo -> StorageFacet storageFacet = repo.facet(StorageFacet)
log.info("#### Repository: " + repo.getName() + " ####") def tx = storageFacet.txSupplier().get() def results = [:].withDefault { 0 } try { tx.begin() tx.browseAssets(tx.findBucket(repo)).each { Asset asset -> if (asset.size() > min_size) { results.put(asset.name(),asset.size()) } } } finally { tx.close() } def sorted = results.sort { a, b -> b.value <=> a.value } log.info(JsonOutput.prettyPrint(JsonOutput.toJson(sorted))) } - (Optional) Customize the script source options, in particular the options for min_size.
- Run the task.
- Examine the nexus.log for the output.
Execute script from command line
Execute the script from the command line making sure
- Groovy is installed on the host where it is run
- dir_name points to the correct path for your repository 3 data directory
- user executing the script has permissions to the file system
- optionally redirect the output to a file for further processing
long min_size = 100000000 String dir_name = '/opt/Nexus/sonatype-work/nexus3' def ant = new AntBuilder() def scanner = ant.fileScanner { fileset(dir: dir_name) { include(name: '**/blobs/**/*.properties') exclude(name: '**/metadata.properties') exclude(name: '**/*metrics.properties') exclude(name: '**/tmp') } } def results = [:].withDefault { 0 } scanner.each { File file -> def properties = new Properties() file.withInputStream { is -> properties.load(is) } long prop_size = properties.size as long; if (prop_size > min_size) { results.put(properties['@BlobStore.blob-name'], prop_size) } } def sorted = results.sort { a, b -> b.value <=> a.value } sorted.each{ k, v -> println "${k}:${v}" }
Obtain Repo Sizes for Postgres Deployments
The following python script can be used to get the size of repos from a Postgres DB. You will need to set the DB connection params:
#!/usr/bin/env python3
import os
from hurry.filesize import size
DB_HOST = os.getenv("NS_DB_HOST")
DB_USER = os.getenv("NS_DB_USER")
DB_PASS = os.getenv("NS_DB_PASS")
DB_DB = os.getenv("NS_DB_DB")
DB_PORT = int(os.getenv("NS_DB_PORT")) if os.getenv("NS_DB_PORT") else 5432
if DB_HOST is None or DB_USER is None or DB_PASS is None or DB_DB is None:
print("Please set the following environment variables:\nNS_DB_HOST\nNS_DB_USER\nNS_DB_PASS\nNS_DB_DB")
print("Exiting...")
exit(1)
conn = psycopg2.connect(
host=DB_HOST,
database=DB_DB,
user=DB_USER,
password=DB_PASS,
port=DB_PORT
)
cur = conn.cursor()
cur.execute("SELECT tablename FROM pg_catalog.pg_tables WHERE tablename LIKE '%_content_repository';")
content_repository_tables_names = [x[0] for x in cur.fetchall()]
repos_size = {}
for content_repo in content_repository_tables_names:
repo_type = content_repo.replace("_content_repository","")
cur.execute(f"""select r.name, sum(blob_size) from {repo_type}_asset_blob t_ab
join {repo_type}_asset t_a on t_ab.asset_blob_id = t_a.asset_blob_id
join {repo_type}_content_repository t_cr on t_cr.repository_id = t_a.repository_id
join repository r on t_cr.config_repository_id = r.id
group by r.name;""")
repos_size.update(dict(cur.fetchall()))
sum = 0
for repo_name in repos_size.keys():
sum += repos_size[repo_name]
human_readable_size = size(repos_size[repo_name])
print(f"{repo_name}: {human_readable_size}")
print(f"Sum: {size(sum)}")
Sample output:
Repo1: 2M
Repo2: 268M
Repo3: 192M
Sum: 462M