Introduction

Always On Availability Groups (AGs/AG) provides a fairly convenient and reliable High-Availability Disaster Recovery (“HADR”) solution. Each database is configured to provide a fail-over solution using multiple servers running Windows Clustering replicas.

The two replicas – primary (supporting reads and writes) and secondary (a.k.a. read replica) are synchronized via your choice of a synchronous or an asynchronous process. Once AG Listener identifies unhealthy primary it will bring read replica as a primary on a different server.

This blog post is going to cover the making of SQL Server Agent (“Agent”) AG-aware. Some knowledge of cursors, sp_executesql and job tokens would be needed.

Problem

For both servers to host a primary database at any point in time, both servers would need to have identical database jobs. For example: any SSIS job that runs against ERP database on what is currently your primary server could suddenly swap to a new server. Unfortunately, Microsoft doesn’t support synchronizing of the msdb database (the database that stores all jobs related information) in AG.

As I previously mentioned, to support AG fail-over we would need identical jobs on both servers. The problem is, how do you support identical jobs on both servers without some of the jobs failing? The same job that writes data into a primary database will fail trying to write into a secondary replica (read-only).

The easiest option of course is not have jobs on both servers and only copy jobs from primary to secondary when fail-over occurred. This is of course a great idea on paper, but it creates a maintenance nightmare for the DBA to copy all the jobs within the allowed SLA timeframe (read: highly impractical).

Another option is to keep all jobs enabled on primary and disabled on secondary; run a script to disable/enable once the fail-over occurs. While this is a better option, it still needs a human to trigger all “enable/disable jobs” job and requires a very very delicate knowledge in your current environment’s jobs to know which jobs are part of AG, which jobs are local, and even which jobs are disabled by the user for whatever reason (read: babysitting needed) .

Solution

A comprehensive solution should take care of all existing jobs, support disabled jobs by user, and also support jobs that are designed to run on primary/secondary only.

How are we going to achieve it? We are going to rely on a job description (HADR.AA.PRIMARY – AG-aware . availability group name . replica role), and dynamically inject first step to stop the the job gracefully based on the job description. As a bonus, the job will remove the conditional stop first step from jobs that are no longer AG-aware based on the job description.

The following code will:

  1. Scan all jobs’ descriptions to identify HADR jobs
  2. Inject first step conditional stop based on the job description
  3. Remove first step conditional stop for jobs that have no HADR reference
-- =============================================
-- Author:	Steve Rezhener
-- Create date: 10/01/2020
-- Description:	Create/Remove HADR check for HADR
 jobs
-- =============================================
-- EXEC [dbo].[usp_maintain_HADRcheck4HARDjobs]
CREATE PROCEDURE [dbo].[usp_maintain_HADRcheck4HARDjobs] 
AS
BEGIN
SET NOCOUNT ON;
	
DECLARE @step_command	AS NVARCHAR(MAX)
DECLARE @job_id			AS UNIQUEIDENTIFIER
DECLARE @job_name		AS VARCHAR(50)
	
-- add hadr check to all hadr-aware jobs
DECLARE agwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR
SELECT	jobs.job_id, jobs.name
FROM	msdb.dbo.sysjobs as jobs
	inner join msdb.dbo.sysjobsteps as steps on steps.job_id = jobs.job_id 
WHERE	1=1
	and jobs.description like '%HADR%'				-- include only hadr aware jobs
	and steps.step_name != 'HADR Check'					-- exlcude jobs that have AGSteps1 step already
	and steps.step_id = 1 
	OPEN	agwarejobs
	FETCH NEXT  FROM agwarejobs  INTO @job_id, @job_name		
	WHILE @@FETCH_STATUS = 0
	BEGIN
	-- HADR.AG1.PRIMARY
	SET @step_command = N'' + 
	' DECLARE @ag_name AS VARCHAR(50),  @replica_role AS VARCHAR(50)' +
	' SELECT @ag_name = PARSENAME(description, 2), @replica_role = PARSENAME(description, 1)' +
	' FROM msdb.dbo.sysjobs' + 
	' WHERE job_id = $(ESCAPE_NONE(JOBID))' +
		' IF EXISTS (' +
		' SELECT 1' +
		' FROM 	sys.dm_hadr_availability_replica_states as hadr_avrest' +
	'		INNER JOIN sys.availability_groups as avgr						ON avgr.group_id = hadr_avrest.group_id' +
	'		INNER JOIN sys.dm_hadr_availability_group_states as hadr_avgrst	ON hadr_avgrst.group_id = hadr_avrest.group_id' +
	' WHERE	1=1' +
	' AND avgr.name = @ag_name ' +
	' AND hadr_avrest.role_desc = @replica_role ' +
	' AND avgr.name = @ag_name ' +
	' AND hadr_avrest.role_desc = @replica_role ' +
	' AND hadr_avrest.operational_state_desc = ''ONLINE''' +
	' AND hadr_avrest.connected_state_desc = ''CONNECTED''' +
	' AND hadr_avrest.synchronization_health_desc = ''HEALTHY''' +
		')' +
		' BEGIN' +
		'' +
		' PRINT ''stop the job''' +
		' EXEC msdb.dbo.sp_stop_job @job_id = $(ESCAPE_NONE(JOBID));' +
		'' +
		' END' +
		' ELSE PRINT ''continue with other steps'''
		PRINT @step_command
		IF EXISTS(SELECT name from msdb.dbo.sysjobs WHERE job_id = @job_id) -- just in case
		BEGIN
			PRINT 'add hadr check step to ' + @job_name + ' job'
			EXEC msdb.dbo.sp_add_jobstep @job_id = @job_id
			, @step_id = 1
			, @step_name = N'HADR Check'
			, @subsystem = N'TSQL'
			, @on_success_action = 3 -- Go to next step
			, @command = @step_command
		
		END
		FETCH NEXT  FROM agwarejobs  INTO @job_id, @job_name	
	END
 
	CLOSE agwarejobs
	DEALLOCATE agwarejobs
	
	--HADR.AA.PRIMARY
	-- remove hadr check from all not hadr-aware jobs = left over due to job description change
	DECLARE notagwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR
	SELECT	jobs.job_id, jobs.name
	FROM	msdb.dbo.sysjobsteps as steps 
		LEFT OUTER JOIN msdb.dbo.sysjobs as jobs on steps.job_id = jobs.job_id 
	WHERE	1=1
		and steps.step_name = 'HADR Check'							-- include jobs that have AGSteps1 step = left over
		and steps.step_id = 1 
		and ISNULL(jobs.description, '') NOT like '%HADR%'			-- exclude jobs that are not hadr aware
	OPEN	notagwarejobs
	FETCH NEXT  FROM notagwarejobs  INTO @job_id, @job_name		
	WHILE @@FETCH_STATUS = 0
	BEGIN
	PRINT 'remove hard check step from ' + @job_name + ' job'
	EXEC msdb.dbo.sp_delete_jobstep @job_id = @job_id
						, @step_id = 1
		
	FETCH NEXT  FROM notagwarejobs  INTO @job_id, @job_name	
	END
 
	CLOSE notagwarejobs
	DEALLOCATE notagwarejobs
END

Disclaimer

This blog post is partially based on the following articles:

https://nigeldba.com/availability-groups-making-sql-agent-jobs-only-run-on-primary-replica

https://sqlundercover.com/2020/01/16/managing-agent-jobs-on-availability-group-servers/