Tagging tens of thousands of Drupal entities in under half a second
To be precise, I just added an entity reference item to 20,263 webform submissions in 449 milliseconds to verify that statistic đ.
This post is a technical but hopefully readable description of how you can add entity field items, en masse, without the need for a batch process, and how I arrived at that solution. But first, why did I need to figure that out in the first place?
Background
Since I left my day job and started building a Drupal-based SaaS company, Iâve been keeping the lights on by working with different clients to build and maintain Drupal applications. Recently, I had the opportunity to write a new module for IFundWomen (IFW), a mission-driven, for-profit organization that serves women entrepreneurs.
A primary line of their business is creating, receiving, and curating large numbers of grant applications. A process which ultimately leads to the award of much-needed capital to women-owned businesses. Theyâve received tens of thousands of applications for their various grant opportunities over the years. Understandably, whittling down such large volumes of applications to a top-tier list of finalists is no small featâcurating these applications by hand takes many hours of effort. I was asked to implement a tool to accelerate that process.
Design considerations
One of the challenges of writing a tool for this task was that some applications are easy to disqualify programmatically: for example, was a necessary field left blank? Yet other qualifying factors are more subjective: for example, was the elevator pitch compelling? In other words, itâs not entirely automatable. In fact, itâs important to have a human-in-the-loop to ensure applications donât slip through the cracks of a rigid, programmatic solution. Therefore, the goal was to design a tool to accelerate, not replace, the process of discovering the most qualified candidates.
We call factors like the ones mentioned above qualifications and while
each qualification is assessed, potentially thousands of submissions need to be
effortlessly disqualified. Hereâs an example: IFWâs Universal grant application
permits IFW to collect applications for multiple grants at once. A hypothetical
toy manufacturing company would like to award a grant to parent entrepreneurs.
The reviewers should be able to disqualify all submissions that didnât
answer Yes
to the Do you have children?
question with no more
than a few clicks.
The module I built (alongside Peter Weber, Kay Thayer, and Thanh Uong) provides a custom interface for curating these submissions. The workflow proceeds like so:
- Create an Application by choosing a webform from which to draw submissions
- Create a Qualification which includes choosing pertinent form elements and
a method of qualification, which is either manual or programmatic:
- If programmatic, a condition is configured such as
[element] value is [at least] [number]
- If programmatic, a condition is configured such as
- A paginated table of submissions and the selected values appears
- If the qualification is programmatic, it is pre-filtered to show submissions which meet the qualification criteria. This can be toggled to show only those that do not meet the criteria
- Select one or more submissions
- If all are selected, a checkbox appears offering to apply the next step to all submissions, across all pages
- Click the Disqualify button
- Repeats steps 2–5 until only fully-qualified submissions remain
- Manually review and select finalists from this list of submissions
The seemingly minor checkbox that appears on step 4a created the problem that motivated this post. That checkbox is a necessary convenience, but it creates a big problem: If selected, pressing the Disqualify button triggers a process that may need to update tens of thousands of entities, yet, it needs to feel fast to the user. Therefore, that process needs to be executed in the most efficient manner possible.
Technical details
A submission is disqualified when one of its entity reference fields (named
disqualifications
) references one of the Qualification entities created in
step 21. Therefore, to disqualify an arbitrary number of submissions, a new
entity reference item must be inserted into the entity reference fieldâs table
for each and every one of the selected submissions. This is essentially how
Drupal tagging works. The only difference is that instead of referencing
taxonomy term entities, submissions reference qualification entities.
Since Drupalâs entity and field APIs do not afford methods for updating multiple entities at once, a custom solution was required. Conveniently, the itemâs value is the same2.
Drupalâs Batch API is a great tool for processing large volumes of work that might exceed PHPâs max execution time limit. My first iteration took this conventional approach. It worked, but slowly. Updating hundreds of entities took more than a minute to complete, which did not provide for a pleasant user experience and that experience quickly degraded as the number of entities to be updated grew arbitrarily large. Not only was a custom solution required, we needed an optimized one too.
Restating the problem: updating thousands of entities is too slow. Moreover, there are no APIs for doing that and using the Batch API still isnât quick enough. What limits the Batch API? Itâs slow because every entity update requires two round-trips to the databaseâone to fetch the entity data and one to save a new value. By loading entities in batches, the Batch API can address half the problem since it still must update them one-by-one. In other words, even in the best-case scenario, the Batch API can only reduce the work by 50%.
That left direct database interaction as the last, best option. By doing so, tens of thousands of rows could be updated in a single route-trip. To understand the specifics of that solution, youâll need to understand how Drupal stores field data (no worries, it’s summarized below).
The most practical drawback of interacting directly with the database is that it works entirely outside of the Drupal entity and field APIs. That means things like cache invalidation are left as an exercise for the reader. It also introduces a tight coupling to the fieldâs database schemaâand if the schema changes upstreamâyour code may abruptly stop working or cause data integrity issues3.
Diving down the stack
In the typical case, Drupalâs field values are stored in unique database
tables. One table per field, per entity type. For example, if you add a term
reference field to a content type with the machine name field_tags
, Drupal
will create a database table named node__field_tags
4. These field tables
always have the following columns and comments:
bundle
- The field instance bundle to which this row belongs, used when deleting a field instance
entity_id
- The entity id this data is attached to
delta
- The sequence number for this data item, used for multi-value fields
…omitted…
- A few other columns not relevant to this post, such as
deleted
andlangcode
These tables also have one or more columns specific to the field type they
store. For example, entity reference fields have a {field_name}_target_id
column which is used to store a reference to another entity. This is how
taxonomy term references work and it is how our moduleâs disqualifications
field works: if an application references a Has website qualification, it has
been dis-qualified for not having a website.
Here is a representation of the disqualifications
table with some example data:
bundle
| entity_id
| delta
| disqualifications_target_id
| …
|
---|---|---|---|---|
grant |
10 |
0 |
is_incorporated |
… |
grant |
10 |
1 |
has_website |
… |
grant |
42 |
0 |
is_incorporated |
… |
… |
… |
… |
… |
… |
The table above shows two disqualified submissions to the grant
webform with
IDs 10
and 42
. Submission 10
was disqualified for two reasons: the
applicant hadn’t incorporated a business yet and the application did not link
to a website. Submission 42
was only disqualified because the applicant
hadn’t incorporated yet, as before.
Hey! If you’d like us to help solve your technical problems too, don’t hesitate to reach out đ Just shoot us an email
Implementing a solution
Now, assume that both submissions must also be disqualified for a missing elevator pitch. NaĂŻvely, one could execute the following query:
$connection = \Drupal::database();
$query = $connection->insert(âfield_table_nameâ)
->fields([
âbundleâ,
âentity_idâ,
âdisqualifications_target_idâ,
]);
foreach ([10, 42] as $submission_id) {
$query->values([
âbundleâ => âgrantâ,
âentity_idâ => $submission_id,
âdisqualifications_target_idâ => âhas_elevator_pitchâ,
]);
}
$query->execute()
// Which compiles to:
// INSERT INTO field_table_name (bundle, entity_id, disqualifications_target_id)
// VALUES
// (âgrantâ, 10, âhas_elevator_pitchâ),
// (âgrantâ, 42, âhas_elevator_pitchâ);
Unfortunately, this would not work since the delta
column value is required.
But what delta should the query insert? 0
will fail because it will conflict
with the existing values for both submissions, 1
will conflict with
submission 10
âs has_website
reference. In this specific case, 2
would
work, but it wouldnât work if another submission already references a third
qualification. The delta
value needs to be at least one more than the highest
delta
value for each of the entities.
Does this mean that we must query for each entityâs highest delta and execute an insert for each one? While that would work, it would be very inefficient. The system would need to execute a database write operation for each entity and doing so would take increasingly more time for every additional entity to be updated. This would not be much faster than loading, updating, and saving each entity one-by-one in a batch process.
My solution was to execute a single SELECT
query using the GROUP_CONCAT
aggregate function in order to construct a single INSERT
query. It proceeds
like so:
SELECT
all rows from the field table. Each row represents a single field itemGROUP
the items by thedelta
value using theGROUP_CONCAT
function- Find all submission IDs to be disqualified that do not have a field item in the
0
bucket - Record those IDs in an array using
0
as the array index - Find all submission IDs to be disqualified that did have a field item in the
0
bucket, but do not have a field item in the1
bucket - Record those IDs in an array using
1
as the array index - Repeat steps 4–7, incrementing the delta and index each iteration, until there are no submission IDs that have not been recorded in the array
Using the example data from above, the resulting array would be serialized like so:
[
0 => [], // No submissions need a 0 delta (IOW, all entities have at least one item already)
1 => [42], // This means: submission 42âs next delta should be 1.
2 => [10], // And submission 10âs next delta should be 2.
]
The solution then iterates over the constructed array to build an INSERT
query. First, it adds values for all the submissions that do not have any
pre-existing references using the index, 0
, as the delta. Next, it adds
values for all submissions with a single reference using the index 1
as the
delta. So on and so forth. Programmatically, the solution reads:
// Note: This code is simplified for clarity.
$connection = \Drupal::database();
$query = $connection->select('webform_submission_disqualifications', 'd')
->fields('d', ['delta'])
->condition('d.entity_id', $submission_ids, 'IN')
->groupBy('d.delta');
$query->addExpression('GROUP_CONCAT(d.entity_id)');
$result = array_map(
fn (string $ids) => array_map('intval', explode(',', $ids)),
$query->execute()->fetchAllKeyed()
);
$delta = 0;
while (!empty($submission_ids)) {
$ids_with_delta = $result[$delta] ?? [];
$delta_ids[$delta] = array_diff($submission_ids, $ids_with_delta);
$submission_ids = $ids_with_delta;
$delta++;
}
$query = $connection->insert('webform_submission_disqualifications')
->fields([
'bundle',
'entity_id',
'revision_id',
'langcode',
'delta',
$column_names['target_id'],
]);
$submission_ids[$delta] = $previous;
foreach ($submission_ids as $delta => $ids) {
foreach ($ids as $id) {
$query->values([
'bundle' => âgrantâ,
'entity_id' => $id,
'delta' => $delta,
âdisqualifications_target_idâ => âhas_elevator_pitchâ,
]);
}
}
$query->execute();
// Using the example data, this compiles to:
// INSERT INTO field_table_name (bundle, entity_id, delta, disqualifications_target_id)
// VALUES
// (âgrantâ, 2, 1, âhas_elevator_pitchâ),
// (âgrantâ, 1, 2, âhas_elevator_pitchâ);
As you may have already noticed, this technique does have its limitations, like lack of translation and revision support or automatic cache invalidation5, but they can definitely be worth the tradeoff as long as theyâre carefully considered.
This technique helped provide an impressively quick user experience. But, perhaps best of all, working on this solution gave me a chance to learn some things about the Drupal field system along the way.
I hope you did too!
- Actually, the field is named
submission_review_disqualifications
since the custom module that provides the field is namedsubmission_review
. âď¸ And that’s why custom fields are always namedfield_{something}
âthey’re provided by thefield
module! [return] - The technique described in this post will not work as-is if the field values must be different between entities, nor will it work as-is for revisioned or translated entities. [return]
- If you do not own the field type, you should consider adding a unit test to ensure that the upstream schema doesnât change without you being alerted to that fact as soon as possible. [return]
- This is not always true. Single-cardinality base fields are stored in columns on the entity typeâs field data table. For example,
node_field_data
stores the nodeâstype
,status
, anduid
field values, among others. [return] - These might be big limitations depending on your needs, but they arenât inherent to the problemâtheyâre mostly due to the lack of an API in Drupal core itself. I think a new interface could be afforded and a more robust implementation could be added in a handler class like
SQLContentEntityStorage
. The blocker to that is finding the time, need, and expertise. [return]