How to Streamline Update Hooks Using the Batch API

Tuesday, December 10, 2019

Hook42 Team

When maintaining a Drupal 8 site in production, it’s often necessary to make changes within the site’s database, specifically when it comes to modifying settings or data that is not handled by Drupal’s YAML file-based configuration management API. Some common examples of settings or data not handled by the configuration management API include the following:

Field values of actual nodes;
Field settings that cannot be altered via the admin UI once those fields are populated with data;
Database schemas/data models;
Menu items;
Taxonomy terms;
and more...

It’s fairly common that the needs of clients may necessitate changes to such data and settings. When that is the case, Drupal’s Update API provides an efficient, well-documented method for making changes to the database layer. The basic methods of tapping into the Update API are either (1) via an implementation of the hook hook_update_N() within the MODULE_NAME.install file in a custom module, or (2) via an implementation of the hook hook_post_update_NAME() within a MODULE_NAME.post_update.php file in a custom module. In case you’re not familiar with writing update hooks in general, my previous blog post provides a good example of what one might look like.

Deciding Which Update Hook to Use

If you are researching Drupal’s Update API you may find that the majority of tutorials focus on implementations of hook_update_N() rather than hook_post_update_NAME(). There is one key difference between the two hooks that you need to consider when choosing which one to implement: hook_update_N() is primarily meant for making changes related to configuration and/or database schemas (i.e., structural updates to a site), as suggested by the Drupal.org documentation. For this reason hook_update_N() should not be used for CRUD operations to update entity data (e.g., the field values on specific nodes), because “loading, saving, or performing any other CRUD operation on an entity is never safe to do (because these tasks rely on hooks and services)”. Instead, if you need to update data like entities, you should implement hook_post_update_NAME(), which Drupal runs in a fully loaded state with all of its APIs available, after any hook_update_N() implementations are run.

In the example below, we will be implementing hook_post_update_NAME() in order to programmatically update hundreds of node field values.

Pushing the Boundaries of your Update Hook

More often than not, you’re probably used to implementing update hooks to achieve relatively simple and straightforward tasks, such as programmatically updating a field value across a handful of nodes. However, because Drupal is so good at being able to store massive amounts of content, sometimes you’ll need to update more than just a handful of nodes. Instead, you might need to update hundreds, if not thousands, of entities all at once.

Technically, you could write your update hook in such a way that those thousands of nodes are updated at one time, one immediately after the other. While that might work fine locally in testing, when you don’t care as much about performance, that style of memory-intensive or long-running update could pose a serious threat to the stability and performance of your site during deployment.

Instead, when it comes to implementing such memory-intensive update hooks, it’s best to reach for Drupal’s Batch API. Luckily, Drupal has made this relatively simple to do when implementing update hooks specifically. In the documentation for hook_update_N() and hook_post_update_NAME(), batch updates are explicitly mentioned as a best practice “if running your update all at once could possibly cause PHP to time out.” As stated in that same documentation, “in this case, your update function acts as an implementation of callback_batch_operation(), and $sandbox acts as the batch context parameter.” To see what this means, it’s probably easiest to look at an example.

Batch API: the better way to process thousands of node updates in one go

The following code is an update hook that I recently had to implement. Within this update hook, the code comments will describe how I was able to use Drupal’s Batch API within the hook, in order to ensure that I could process hundreds of node updates without significantly affecting the live production site’s performance.

For a bit of background context, this particular site had two content types referred to here as FOO and BAR. Any given BAR node was meant to exist only as a child in relation to a parent FOO node (within a one-way, BAR-to-FOO relationship), and every FOO node should have a child BAR node at all times. However, the problem was that, due to an initial architectural decision, not all FOO nodes had a child BAR node at all times, because the creation of the child BAR node was only triggered by a specific action taken by a user. Therefore, we needed to write an update hook to create a child BAR node for any and all FOO nodes that we're currently missing a referring child BAR node.

In summary: for every FOO node, make sure there is also a BAR node. Currently, many FOO do not have a corresponding BAR. Let’s begin!

/**
 * @file
 * docroot/modules/custom/my_module/my_module.post_update.php
 */

/**
 * Creates a BAR node for any FOO node that does
 * not yet have a corresponding BAR node.
 */
function my_module_post_update_create_missing_nodes(&$sandbox) {

  // On the first run, we gather all of our initial
  // data as well as initialize all of our sandbox variables to be used in
  // managing the future batch requests.
  if (!isset($sandbox['progress'])) {

   /**
    * To set up our batch process, we need to collect all of the
    * necessary data during this initialization step, which will
    * only ever be run once.
    */

   /**
    * We start the batch by running some SELECT queries up front
    * as concisely as possible. The results of these expensive queries
    * will be cached by the Batch API so we do not have to look up
    * this data again during each iteration of the batch.
    */
    /** @var \Drupal\Core\Database\Connection $database */
    $database = \Drupal::database();
    // Fetches node IDs of FOO nodes with BAR node references.
    $FOOs_with_BAR_refs = $database
      ->query("SELECT DISTINCT field_FOO_ref_target_id FROM {node__field_FOO_ref}")
      ->fetchCol();
    // Fetches NIDs of FOO nodes without BAR node references.
    $FOOs_without_BAR_refs = $database
      ->query(
        "SELECT nid FROM {node_field_data}
        WHERE type = :type
        AND status = :status
        AND nid NOT IN (:nids[])",
        [
          ':type' => 'FOO',
          ':status' => 1,
          ':nids[]' => $FOOs_with_BAR_refs,
        ]
      )
      ->fetchCol();

   /**
    * Now we initialize the sandbox variables.
    * These variables will persist across the Batch API’s subsequent calls
    * to our update hook, without us needing to make those initial
    * expensive SELECT queries above ever again.
    */

    // 'max' is the number of total records we’ll be processing.
    $sandbox['max'] = count($FOOs_without_BAR_refs);
    // If 'max' is empty, we have nothing to process.
    if (empty($sandbox['max'])) {
      $sandbox['#finished'] = 1;
      return;
    }

    // 'progress' will represent the current progress of our processing.
    $sandbox['progress'] = 0;

    // 'nodes_per_batch' is a custom amount that we’ll use to limit
    // how many nodes we’re processing in each batch.
    // This is a large part of how we limit expensive batch operations.
    $sandbox['nodes_per_batch'] = 18;

    // 'FOOs_without_BAR_refs' will store the node IDs of the FOO nodes
    // that we just queried for above during this initialization phase.
    $sandbox['FOOs_without_BAR_refs'] = $FOOs_without_BAR_refs;

  } 

  // Initialization code done. The following code will always run:
  // both during the first run AND during any subsequent batches.

  // Now let’s create the  missing BAR nodes.
  $node_storage = \Drupal::entityTypeManager()->getStorage('node');

  // Calculates current batch range.
  $range_end = $sandbox['progress'] + $sandbox['nodes_per_batch'];
  if ($range_end > $sandbox['max']) {
    $range_end = $sandbox['max'];
  }

  // Loop over current batch range, creating a new BAR node each time.
  for ($i = $sandbox['progress']; $i < $range_end; $i++) {
    $BAR_node = $node_storage->create([
      // NOTE: title will be set automatically via auto_entitylabel.
      'type' => 'BAR',
      'field_FOO_ref' => [
        'target_id' => $sandbox['FOOs_without_BAR_refs'][$i],
      ],
    ]);
    $BAR_node->status = 1;
    $BAR_node->enforceIsNew();
    $BAR_node->save();
  }

  // Update the batch variables to track our progress

  // We can calculate our current progress via a mathematical fraction.
  // Drupal’s Batch API will stop executing our update hook as soon as
  // $sandbox['#finished'] == 1 (viz., it evaluates to TRUE).
  $sandbox['progress'] = $range_end;
  $progress_fraction = $sandbox['progress'] / $sandbox['max'];
  $sandbox['#finished'] = empty($sandbox['max']) ? 1 : $progress_fraction;

  // While processing our batch requests, we can send a helpful message
  // to the command line, so developers can track the batch progress.
  if (function_exists('drush_print')) {
    drush_print('Progress: ' . (round($progress_fraction * 100)) . '% (' .
    $sandbox['progress'] . ' of ' . $sandbox['max'] . ' nodes processed)');
  }

    // That’s it!
    // The update hook and Batch API manage the rest of the process.

}

Simple Solutions are Easier

As seen in the code above, Drupal’s Update API makes it relatively straightforward to quickly leverage its Batch API for memory-intensive and long-running update operations. With everything being managed by a simple $sandbox array of variables, we can tackle even the largest update operations on our sites with simple, yet efficient, batched processing.

Drupal Module Development

Enterprise Drupal Development

Scalability