Investigating Blob Store and Repository Size and Space Usage

<TABLE OF CONTENTS>

When working with repositories and blob stores, you may want to have some insight into how the storage space is being used.

This article provides a few different ways to find out where repository and blob store space is being consumed.

Using Queries on the OrientDB Component Backup to Report Asset Sizes

All of your asset metadata is stored in the OrientDB component database. Connecting to and running queries against the database while simultaneously being used by the Repository process is strongly discouraged.

However you can certainly safely run queries against a recent component database backup not in use by other processes.

Report of Total Asset Record Sizes Per Repository 

  1. Create a database backup using the Admin - Export databases for backup task
  2. Download a stand-alone OrientDB executable console jar to the host where the backups reside:
    Download orient-console.jar
    curl -O -L https://sonatype.zendesk.com/hc/article_attachments/4412115223955/orient-console.jar;
  3. Execute a query against the component db backup that will generate the report. Adjust the values as needed:
    • replace 'maven-central' with the id of your repository
    • replace blob_updated value with your own date, or remove that criteria
    • replace last_downloaded value with your own date or remove that criteria
    • replace extractDir value to a path where there is enough disk to unpack the bak zip file
    • replace exportPath value to a file location where you want the output generated
    echo "SELECT bucket.repository_name as r, count(*) as asset_count, \
    sum(size) as asset_size FROM asset WHERE bucket.repository_name = 'maven-central' \
    AND blob_updated <= '2021-09-29' AND last_downloaded <= '2021-11-28' \
    GROUP BY bucket" | java -Xmx4g -DextractDir=./component -DexportPath=./result.json \
    -jar ./orient-console.jar /your/database/backup/path/component-2021-11-17-22-00-00-3.34.0-01.bak
  4. Sample output in the result file
    cat ./result.json
    [
      {"@type":"d","@rid":"#-2:1","@version":0,"r":"maven-central","asset_count":30,"asset_size":2968853,"@fieldTypes":"asset_count=l,asset_size=l"}
    ]

NOTE: If you see "java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.Native" error, this can be caused by "noexec" flag on /tmp. If you do not want to change the mount flag on /tmp, you can change the temp directory location by adding "-Djava.io.tmpdir=/new/tmp/location".

 

Using Groovy Scripting to Report Blob Sizes

The following options require that scripting is enabled inside Repository 3. Enabling and using scripting has risk. Newer versions of Repository have scripting is disabled by default. Use scripts at your own risk. Sonatype recommends testing scripts in a non-production environment first.

Estimating the Effects of Cleanup Policy Criteria

Use Case

One may want to estimate the amount of storage space that could potentially be cleaned up by using certain Cleanup Policy criteria.

Download the Script

The groovy script source to generate this report can be found here: EstimateCleanupPolicyEffects.groovy

Running the Script

The script can be executed as a task in Nexus Repository Manager.

  1. Enable Scripting
  2. Under Administration > System > Tasks. Create a new Admin - Execute Script task.
    Language: groovy
    Task frequency: Manual
    Source: paste the contents of EstimateCleanupPolicyEffects.groovy into the field.
  3. (Optional) Customize the script source defined options at the top of the file to match your proposed cleanup policy criteria.
  4. Save and Run the task.

Examining the Output

Check the nexus.log for the script output which reports per blobstore and per repository results of asset counts and byte size of storage that would be included in the criteria. Example output:

{
  "blobstores": {
    "default": {
      "count": 2,
      "size": 51381
    }
  },
  "repositories": {
    "raw-hosted": {
      "count": 1,
      "size": 24927
    },
    "raw-proxy": {
      "count": 1,
      "size": 26454
    }
  }
}

 

Listing the Size of File Based Blobs in Repositories / Blobstores

Use Case

For file-based blob stores (not cloud based), it is possible to run a script against the actual blobs on disk, not in the database to report actual used blob storage and also how much potential space could be reclaimed by hard-deleting already soft-deleted blobs. 

Download the Script

The groovy script source to generate this report can be found here: nx-blob-repo-space-report-20220510.groovy

Running The Script

The script can be executed as a task in Nexus Repository Manager.

  1. Enable Scripting
  2. Under Administration > System > Tasks. Create a new Admin - Execute Script task.
    Language: groovy
    Task frequency: Manual
    Source: paste the contents of nx-blob-repo-space-report-20220510.groovy into the field.
  3. (Optional, Recommended) Customize the script source defined options, in particular the options to include or exclude specific repositories.
    Important: If your instance has a large number of repositories, please utilize "REPOSITORY_WHITELIST" and "REPOSITORY_BLACKLIST" to reduce the execution time and memory usage by this task.  
  4. Save and Run the task.

Examining the Report

When you execute the task, the output within nexus.log will look similar to the following (the directories scanned will differ):

*SYSTEM Script47 - Blob Storage scan STARTED.
*SYSTEM Script47 - Scanning /home/nexus/sonatype-work/nexus3/blobs/default
*SYSTEM Script47 - Scanning /opt/nexus/test2
*SYSTEM Script47 - Scanning /home/nexus/sonatype-work/nexus3/blobs/test1
*SYSTEM Script47 - Blob Storage scan ENDED. Report at /home/nexus/sonatype-work/nexus3/tmp/repoSizes-20181213-104154.json

You should be able to find the generated JSON report at the location provided in the log - the actual location will vary according to your configuration:

Report at /home/nexus/sonatype-work/nexus3/tmp/repoSizes-20181213-104154.json

Within the JSON report, there are details of each blob store and each repository that uses the blob store. For example, the output below shows two blob stores, each having a single repository:

{
"blobstore1": {
"repositories": {
"repositoryA": {
"reclaimableBytes": 0,
"totalBytes": 4173387
}
},
"totalBlobStoreBytes": 4173387,
"totalReclaimableBytes": 0,
"totalRepoNameMissingCount": 0
},
"blobstore2": {
"repositories": {
"repositoryB": {
"reclaimableBytes": 0,
"totalBytes": 1397598
}
},
"totalBlobStoreBytes": 1397598,
"totalReclaimableBytes": 0,
"totalRepoNameMissingCount": 0
}
}

For each repository, totalbytes indicates how much space is being used and reclaimableBytes indicates how much space may be reclaimed by running the Compact Blob Store maintenance task.

For each blob store, all of the repository entries are aggregated. totalRepoNameMissingCount will display how many assets within the blob store are associated with a repository that no longer exists.

The report will also include Repositories that are empty.

Problems Running the Script

Caused by: java.lang.NoClassDefFoundError: org/apache/tools/ant/BuildLogger

The message suggests that you are using an older version of the script incompatible with your current version of Nexus Repository Manager. Please download and use the latest version from this article and try running the script again.

OutOfMemoryError

For large instances, or instances where overall heap memory available is constrained, this script can cause the heap memory to fill and the script will fail triggering out of memory errors. For this reason the options to limit which repositories to process should be used.

Taking a long time to complete

If the script takes many hours or days even to complete, use the limit options at the top of the script to limit the scope of the script.

Finding the Largest Blobs Within a Blob Store

The groovy script below will help find the largest blobs inside a blob store directory. The output will contain a list of blobs larger than 100M sorted by size. Adjust min_size as required.

Execute script from Task

An alternate version of the script below that can be executed as a task in Nexus Repository Manager.

  1. Enable Scripting
  2. Under Administration > System > Tasks. Create a new Admin - Execute Script task.
    Language: groovy
    Task frequency: Manual
    Source: paste the contents of the script in this field
    import org.sonatype.nexus.repository.storage.StorageFacet
    import org.sonatype.nexus.repository.Repository
    import org.sonatype.nexus.repository.storage.Asset
    import groovy.json.JsonOutput
    
    long min_size = 100000000
    
    repository.repositoryManager.browse().each { Repository repo ->
        StorageFacet storageFacet = repo.facet(StorageFacet)
    log.info("#### Repository: " + repo.getName() + " ####") def tx = storageFacet.txSupplier().get() def results = [:].withDefault { 0 } try { tx.begin() tx.browseAssets(tx.findBucket(repo)).each { Asset asset -> if (asset.size() > min_size) { results.put(asset.name(),asset.size()) } } } finally { tx.close() } def sorted = results.sort { a, b -> b.value <=> a.value } log.info(JsonOutput.prettyPrint(JsonOutput.toJson(sorted))) }
  3. (Optional) Customize the script source options, in particular the options for min_size.
  4. Run the task.
  5. Examine the nexus.log for the output.

Execute script from command line

Execute the script from the command line making sure

  • Groovy is installed on the host where it is run
  • dir_name points to the correct path for your repository 3 data directory
  • user executing the script has permissions to the file system
  • optionally redirect the output to a file for further processing
long min_size = 100000000
String dir_name = '/opt/Nexus/sonatype-work/nexus3'

def ant = new AntBuilder()
def scanner = ant.fileScanner {
  fileset(dir: dir_name) {
    include(name: '**/blobs/**/*.properties')
    exclude(name: '**/metadata.properties')
    exclude(name: '**/*metrics.properties')
    exclude(name: '**/tmp')
  }
}
def results = [:].withDefault { 0 }
scanner.each { File file ->
  def properties = new Properties()
  file.withInputStream { is ->
    properties.load(is)
  }
  long prop_size = properties.size as long;
  if (prop_size > min_size) {
    results.put(properties['@BlobStore.blob-name'], prop_size)
  }
}
def sorted = results.sort { a, b -> b.value <=> a.value }

sorted.each{ k, v -> println "${k}:${v}" }
Have more questions? Submit a request

0 Comments

Article is closed for comments.