Data Archival



Overview

  • As a database grows very big basic functions like indexation and document distribution may slow down. Archiving documents when they become obsolete, when their usage becomes scarce, or for various other reasons, can exempt RavenDB features from processing these documents and significantly improve performance.

  • A document can be scheduled for archival by adding its metadata an @archive-at property with the requested archival time (in UTC) as a value.
    When the archival feature is enabled, RavenDB runs an archiving task that periodically scans the database for documents scheduled for archival.

    On a cluster, the archiving task is running on one node only, which is always the first node in the cluster topology. Archived documents are then propagated to the other nodes by regular replication.

  • When it's time to archive a document the archiving task archives it and then replaces its metadata @archive-at property with an @archived: true property.

    A metadata @archived: true property is just an external indication that RavenDB has archived a document. Users cannot archive documents manually by adding this property to their metadata. To archive a document, schedule its archival.

  • Features like indexing and data subscriptions recognize archived documents by internal RavenDB flags, and can react to them based either on a default high-level policy (set by server or database configuration options) or on a local index or task definition.

  • Other features, and RavenDB clients, can recognize archived documents by their metadata @archived: true property and apply whatever specific logic suits them.
    A user-defined ETL task, for example, can avoid this way from sending its target archived documents.


Licensing

Archival is Available on an Enterprise license.

Learn more about licensing here.

Scheduling Document Archival

A document can be scheduled for archival by adding its metadata an @archive-at property with the designated archival time (UTC) as a value.

Provide the scheduled time in DateTime format, in UTC.
E.g., DateTime.UtcNow

companies/90-A:

{
    "Name": "Wilman Kala",
    "Phone": "90-224 8858",
    "@metadata": {
        "@archive-at": "2024-01-01T12:00:00.000Z",
        "@collection": "Companies",
     }
}

RavenDB scans for documents scheduled for archival in a frequency set by DataArchivalConfiguration.ArchiveFrequencyInSec.

Around the specified time (considering the scan frequency) the document will be archived:

  • The document will be compressed.

    Compressing archived documents saves disk space.
    Note, though, that archived docs' compression, as well as their possible absence from any index, makes their retrieval longer and more CPU/memory consuming than that of non-archived documents.

  • RavenDB will set its internal flags so features like indexing and data subscriptions can recognize the document as archived.

  • The document's @archive-at property will be replaced with an @archived: true property so clients can recognize the document as archived and handle it however they choose.

    companies/90-A:

    {
            "Name": "Wilman Kala",
            "Phone": "90-224 8858",
            "@metadata": {
                "@archived": true,
                "@collection": "Companies",
             }
        }

Archival and Other Features

Archiving and Indexing

Indexing efficiency in particular may drop when a database becomes very big, as a larger number of documents requires more indexing resources, increases the number and size of indexes, and may eventually reduce querying speed.

Routinely archiving documents and excluding the archived documents from indexation is, therefore, an all-round performance enhancer, as fewer and more effective indexes are created for queries that are executed over smaller datasets of higher priority.

An index may inherit the way it handles archived documents from the default server or database configuration, or have this behavior defined in the index definition, overriding higher-level configuration.

  • An index definition can override the default database/server configuration to determine how the index would process archived documents.

    store.Maintenance.Send(new PutIndexesOperation(new[] {
    new IndexDefinition
    {
        Maps = {
            //...
                },
    
        Name = "indexName",
    
        // Process archived documents
        ArchivedDataProcessingBehavior = ArchivedDataProcessingBehavior.IncludeArchived
    }}));

    see the definition of ArchivedDataProcessingBehavior here.

  • An index definition can also check whether the document metadata includes @archived: true, and if so freely apply any archived-document logic.

    store.Maintenance.Send(new PutIndexesOperation(new[] {
    new IndexDefinition
    {
        Maps = {
                    // This will apply only to non-archived documents
                    // (whose @archived property is null)
                    "from o in docs where o[\"@metadata\"][\"@archived\"] == null select new" +
                    "{" +
                    "    Name = o.Name" +
                    "}"
                }
    }}));
  • ArchivedDataProcessingBehavior can be used with additional static index creation methods.

    • When the index is created using IndexDefinitionBuilder:

      var indexDefinition = new IndexDefinitionBuilder<Company>
      {
          Map = companies => from company in companies where company.Name == 
                             "Company Name" select new { company.Name },
      }.ToIndexDefinition(store.Conventions);
      
      indexDefinition.Name = "indexName";
      
      // Process only archived documents
      indexDefinition.ArchivedDataProcessingBehavior = 
              ArchivedDataProcessingBehavior.ArchivedOnly;
      
      store.Maintenance.Send(new PutIndexesOperation(indexDefinition));
    • When the index is created using AbstractIndexCreationTask:

      public class AllCompanies_AddressText : AbstractIndexCreationTask<Company>
      {
          public AllCompanies_AddressText()
          {
              Map = companies => from company in companies
                                 select new IndexEntry
                                 {
                                     AddressText = company.Address
                                 };
      
              ArchivedDataProcessingBehavior =
                  Raven.Client.Documents.DataArchival.ArchivedDataProcessingBehavior.IncludeArchived;
          }
      }

Archiving and Data Subscriptions

Data subscriptions exclusion of archived documents from data batches reduces workload for both the server and the workers which may now receive fewer and more relevant documents.

As with indexes, a data subscription task may inherit a default high-level policy toward archived documents or override it locally, in the task definition.

  • The below data subscription task overrides default server/database settings and uses its ArchivedDataProcessingBehavior property with the ArchivedDataProcessingBehavior enum to process only archived documents.
    var subsId = await store.Subscriptions.CreateAsync(new SubscriptionCreationOptions
    {
        Query = "from Companies",
        Name = "Created",
        // Process only archived documents
        ArchivedDataProcessingBehavior = ArchivedDataProcessingBehavior.ArchivedOnly
    });

Archiving and Document Extensions

The document extensions of an archived document are not archived or affected in any way by the archival status of their parent documents. A time series, for example, will be indexed even if the document that owns it is archived.


Archiving and Smuggler (Import/Export)

Smuggler, used by RavenDB to import and export data, checks documents' archival status and can be set to skip archived docs.

Determine whether archived documents would be transferred or not by setting to true or false the boolean IncludeArchived property in the DatabaseSmugglerExportOptions instance you pass Smuggler.

In the following example, the exported data includes archived documents.

var operation =
await store.Smuggler.ExportAsync
        (new DatabaseSmugglerExportOptions { IncludeArchived = true }, path);

By default, archived documents are Included when importing/exporting data.


Archiving and Expiration

Archiving can be used alongside other extensions like expiration.
A document can, for example, be scheduled for archival in half a year, and for expiration in a year. This would keep newer documents alive and within immediate reach, archive older documents whose retrieval, should it be required, is allowed to be slower, and have documents that are no longer needed deleted.

companies/90-A:

{
    "Name": "Wilman Kala",
    "Phone": "90-224 8858",
    "@metadata": {
        "@archive-at": "2024-03-06T22:45:30.018Z",
        "@expires": "2024-09-06T22:45:30.018Z",
        "@collection": "Companies",
     }
}

Archiving and ETL

An ETL transform script can check each extracted document's metadata for an @archived: true property, whose existence indicates the document is archived, and handle the document by the result.
Archived documents can be skipped, for example, or only relevant parts of them can be sent to the target.

var isArchived = this['@metadata']['@archived'];
if (isArchived === true) {
    return; // do not load archived documents
}}

Archiving and Backup

Archived documents Are included in backups.


Archiving and Querying

Collection queries will retrieve archived documents (since they do not run over indexes that exclude archived docs from the results).

Auto indexes will not retrieve archived documents, if the indexes were created when the default configuration excluded archived documents.

Static indexes will not retrieve archived documents, if the indexes were created when the default configuration excluded archived documents or the index definition was edited to exclude them locally.


Archiving and Replication

Archived documents Are included in Regular replication, External replication, and Hub/Sink replication.

Archiving and Patching

Archive by Patch

To archive documents using patching, schedule their archival using the patch API archived.archiveAt method.

// Pass the method the document object and a string with the designated archival (`UTC`) time.
archived.archiveAt (doc, utcDateTimeString)

// Archive all the documents in a collection
var operation = await store.Operations.SendAsync(new PatchByQueryOperation(new IndexQuery()
{
    // provide the time in UTC format, e.g. 2024-01-01T12:00:00.000Z
    Query = "from Companies update { archived.archiveAt(this, \"" + time + "\") }"
}));

Unarchive by Patch

Since unarchiving documents is most often a mass operation, unarchiving is done using the patch API archived.unarchive method. To unarchive a document, pass the method the document object.

// Unarchive all archived document in a collection
var operation =
    await store.Operations.SendAsync(
        new PatchByQueryOperation(new IndexQuery()
        {
            Query = @"from Companies update 
            {
                archived.unarchive(this)
            }"
        }));

Be aware that a patch query may run over an index, and if the index excludes archived documents the query will not find any archived documents to unarchive.

The following patch query, for example, runs over an index.
If this is an auto index that inherits the default configuration, archived documents will be excluded and the patch will find and unarchive no documents.

var operation =
    await store.Operations.SendAsync(
        new PatchByQueryOperation(new IndexQuery()
        {
            // This query uses an index, and if the index excludes 
            // archived docs - unarchiving will fail.
            Query = @"from Companies where Name == 'shoes' update
                    {
                        archived.unarchive(this)
                    }"
        }));

Two possible workarounds are:

  • Configure the index that the patch you're running uses to include archived documents, as explained here.
  • Run a simple collection query, that creates and uses no index, and then apply your own logic to find the documents you want to unarchive. E.g. -
    var operation =
        await store.Operations.SendAsync(
            new PatchByQueryOperation(new IndexQuery()
            {
                // This collection query will not exclude archived docs,
                // and the inner logic will select docs and unarchive them.  
                Query = @"from Companies as company update 
                          {
                            if (company.Name == 'shoes')
                            archived.unarchive(this)
                          }"
            }));

Enabling Archiving and Setting Scan Frequency

Archiving is disabled by default.
To Enable the feature on the database, and to set the Frequency by which RavenDB scans the database for documents scheduled for archiving, pass the ConfigureDataArchivalOperation operation a DataArchivalConfiguration instance.

var configuration = new DataArchivalConfiguration
{
    // Enable archiving
    Disabled = false,
    // Scan for documents scheduled for archiving every 180 seconds 
    ArchiveFrequencyInSec = 180
};

var result = await store.Maintenance.SendAsync(
                    new ConfigureDataArchivalOperation(configuration));
Parameter Type Description
Disabled bool If set to true, archival is disabled for the entire database.
Default: true
ArchiveFrequencyInSec long Frequency (in sec) in which the server checks for documents that need to be archived.
Default: 60

Read here about setting document archival using Studio.

Default (Server/Database) Configuration Options

RavenDB features that currently include built-in support for archived documents are:

  • Auto indexing
  • Static indexing
  • Data subscription

These features can be configured to exclude, include, or handle only archived documents they encounter, using the ArchivedDataProcessingBehavior enum (see below).

The archiving policy is applied to an index or a data subscription task when the index or task are created, i.e. changing the default configuration will not change the behavior of an existing index or data subscription task.

ArchivedDataProcessingBehavior:

public enum ArchivedDataProcessingBehavior
{
    // Exclude archived documents: avoid indexing them or sending them to workers  
    ExcludeArchived,
    // Include archived documents: index them or send them to workers
    IncludeArchived,
    // Handle ONLY archived documents: index or send to workers only archived documents  
    ArchivedOnly
}
  • server and database configuration options allow you to apply a default behavior toward archived documents across this database or all databases.
  • The same configuration options can be applied to a specific index or data subscription definition, overriding the default behavior.
  • The definition of an index can also read the metadata of a certain document and apply additional logic by its archival status.

The following configuration options determine how RavenDB features handle archived documents.

Note that configuring the behavior of a specific index or data subscription task when they encounter archived documents will override the default settings presented here.


Indexing.Auto.ArchivedDataProcessingBehavior:

Use this option to set the behavior of auto indexes across the database when they encounter archived documents.


Indexing.Static.ArchivedDataProcessingBehavior:

Use this option to set the behavior of static indexes across the database when they encounter archived documents.


Subscriptions.ArchivedDataProcessingBehavior:

Use this option to set the behavior of data subscription tasks across the database when they encounter archived documents.