Introduction
Always On Availability Groups (AGs/AG) provides a fairly convenient and reliable High-Availability Disaster Recovery (“HADR”) solution. Each database is configured to provide a fail-over solution using multiple servers running Windows Clustering replicas.
The two replicas – primary (supporting reads and writes) and secondary (a.k.a. read replica) are synchronized via your choice of a synchronous or an asynchronous process. Once AG Listener identifies unhealthy primary it will bring read replica as a primary on a different server.
This blog post is going to cover the making of SQL Server Agent (“Agent”) AG-aware. Some knowledge of cursors, sp_executesql and job tokens would be needed.
Problem
For both servers to host a primary database at any point in time, both servers would need to have identical database jobs. For example: any SSIS job that runs against ERP database on what is currently your primary server could suddenly swap to a new server. Unfortunately, Microsoft doesn’t support synchronizing of the msdb database (the database that stores all jobs related information) in AG.
As I previously mentioned, to support AG fail-over we would need identical jobs on both servers. The problem is, how do you support identical jobs on both servers without some of the jobs failing? The same job that writes data into a primary database will fail trying to write into a secondary replica (read-only).
The easiest option of course is not have jobs on both servers and only copy jobs from primary to secondary when fail-over occurred. This is of course a great idea on paper, but it creates a maintenance nightmare for the DBA to copy all the jobs within the allowed SLA timeframe (read: highly impractical).
Another option is to keep all jobs enabled on primary and disabled on secondary; run a script to disable/enable once the fail-over occurs. While this is a better option, it still needs a human to trigger all “enable/disable jobs” job and requires a very very delicate knowledge in your current environment’s jobs to know which jobs are part of AG, which jobs are local, and even which jobs are disabled by the user for whatever reason (read: babysitting needed) .
Solution
A comprehensive solution should take care of all existing jobs, support disabled jobs by user, and also support jobs that are designed to run on primary/secondary only.
How are we going to achieve it? We are going to rely on a job description (HADR.AA.PRIMARY – AG-aware . availability group name . replica role), and dynamically inject first step to stop the the job gracefully based on the job description. As a bonus, the job will remove the conditional stop first step from jobs that are no longer AG-aware based on the job description.
The following code will:
- Scan all jobs’ descriptions to identify HADR jobs
- Inject first step conditional stop based on the job description
- Remove first step conditional stop for jobs that have no HADR reference
-- =============================================
-- Author: Steve Rezhener
-- Create date: 10/01/2020
-- Description: Create/Remove HADR check for HADR
jobs
-- =============================================
-- EXEC [dbo].[usp_maintain_HADRcheck4HARDjobs]
CREATE PROCEDURE [dbo].[usp_maintain_HADRcheck4HARDjobs]
AS
BEGIN
SET NOCOUNT ON;
DECLARE @step_command AS NVARCHAR(MAX)
DECLARE @job_id AS UNIQUEIDENTIFIER
DECLARE @job_name AS VARCHAR(50)
-- add hadr check to all hadr-aware jobs
DECLARE agwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR
SELECT jobs.job_id, jobs.name
FROM msdb.dbo.sysjobs as jobs
inner join msdb.dbo.sysjobsteps as steps on steps.job_id = jobs.job_id
WHERE 1=1
and jobs.description like '%HADR%' -- include only hadr aware jobs
and steps.step_name != 'HADR Check' -- exlcude jobs that have AGSteps1 step already
and steps.step_id = 1
OPEN agwarejobs
FETCH NEXT FROM agwarejobs INTO @job_id, @job_name
WHILE @@FETCH_STATUS = 0
BEGIN
-- HADR.AG1.PRIMARY
SET @step_command = N'' +
' DECLARE @ag_name AS VARCHAR(50), @replica_role AS VARCHAR(50)' +
' SELECT @ag_name = PARSENAME(description, 2), @replica_role = PARSENAME(description, 1)' +
' FROM msdb.dbo.sysjobs' +
' WHERE job_id = $(ESCAPE_NONE(JOBID))' +
' IF EXISTS (' +
' SELECT 1' +
' FROM sys.dm_hadr_availability_replica_states as hadr_avrest' +
' INNER JOIN sys.availability_groups as avgr ON avgr.group_id = hadr_avrest.group_id' +
' INNER JOIN sys.dm_hadr_availability_group_states as hadr_avgrst ON hadr_avgrst.group_id = hadr_avrest.group_id' +
' WHERE 1=1' +
' AND avgr.name = @ag_name ' +
' AND hadr_avrest.role_desc = @replica_role ' +
' AND avgr.name = @ag_name ' +
' AND hadr_avrest.role_desc = @replica_role ' +
' AND hadr_avrest.operational_state_desc = ''ONLINE''' +
' AND hadr_avrest.connected_state_desc = ''CONNECTED''' +
' AND hadr_avrest.synchronization_health_desc = ''HEALTHY''' +
')' +
' BEGIN' +
'' +
' PRINT ''stop the job''' +
' EXEC msdb.dbo.sp_stop_job @job_id = $(ESCAPE_NONE(JOBID));' +
'' +
' END' +
' ELSE PRINT ''continue with other steps'''
PRINT @step_command
IF EXISTS(SELECT name from msdb.dbo.sysjobs WHERE job_id = @job_id) -- just in case
BEGIN
PRINT 'add hadr check step to ' + @job_name + ' job'
EXEC msdb.dbo.sp_add_jobstep @job_id = @job_id
, @step_id = 1
, @step_name = N'HADR Check'
, @subsystem = N'TSQL'
, @on_success_action = 3 -- Go to next step
, @command = @step_command
END
FETCH NEXT FROM agwarejobs INTO @job_id, @job_name
END
CLOSE agwarejobs
DEALLOCATE agwarejobs
--HADR.AA.PRIMARY
-- remove hadr check from all not hadr-aware jobs = left over due to job description change
DECLARE notagwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR
SELECT jobs.job_id, jobs.name
FROM msdb.dbo.sysjobsteps as steps
LEFT OUTER JOIN msdb.dbo.sysjobs as jobs on steps.job_id = jobs.job_id
WHERE 1=1
and steps.step_name = 'HADR Check' -- include jobs that have AGSteps1 step = left over
and steps.step_id = 1
and ISNULL(jobs.description, '') NOT like '%HADR%' -- exclude jobs that are not hadr aware
OPEN notagwarejobs
FETCH NEXT FROM notagwarejobs INTO @job_id, @job_name
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT 'remove hard check step from ' + @job_name + ' job'
EXEC msdb.dbo.sp_delete_jobstep @job_id = @job_id
, @step_id = 1
FETCH NEXT FROM notagwarejobs INTO @job_id, @job_name
END
CLOSE notagwarejobs
DEALLOCATE notagwarejobs
END
Disclaimer
This blog post is partially based on the following articles:
https://nigeldba.com/availability-groups-making-sql-agent-jobs-only-run-on-primary-replica
https://sqlundercover.com/2020/01/16/managing-agent-jobs-on-availability-group-servers/