Always On Availability Groups (AGs/AG) provides a fairly convenient and reliable High-Availability Disaster Recovery (“HADR”) solution. Each database is configured to provide a fail-over solution using multiple servers running Windows Clustering replicas.
The two replicas – primary (supporting reads and writes) and secondary (a.k.a. read replica) are synchronized via your choice of a synchronous or an asynchronous process. Once AG Listener identifies unhealthy primary it will bring read replica as a primary on a different server.
For both servers to host a primary database at any point in time, both servers would need to have identical database jobs. For example: any SSIS job that runs against ERP database on what is currently your primary server could suddenly swap to a new server. Unfortunately, Microsoft doesn’t support synchronizing of the msdb database (the database that stores all jobs related information) in AG.
As I previously mentioned, to support AG fail-over we would need identical jobs on both servers. The problem is, how do you support identical jobs on both servers without some of the jobs failing? The same job that writes data into a primary database will fail trying to write into a secondary replica (read-only).
The easiest option of course is not have jobs on both servers and only copy jobs from primary to secondary when fail-over occurred. This is of course a great idea on paper, but it creates a maintenance nightmare for the DBA to copy all the jobs within the allowed SLA timeframe (read: highly impractical).
Another option is to keep all jobs enabled on primary and disabled on secondary; run a script to disable/enable once the fail-over occurs. While this is a better option, it still needs a human to trigger all “enable/disable jobs” job and requires a very very delicate knowledge in your current environment’s jobs to know which jobs are part of AG, which jobs are local, and even which jobs are disabled by the user for whatever reason (read: babysitting needed) .
A comprehensive solution should take care of all existing jobs, support disabled jobs by user, and also support jobs that are designed to run on primary/secondary only.
How are we going to achieve it? We are going to rely on a job description (HADR.AA.PRIMARY – AG-aware . availability group name . replica role), and dynamically inject first step to stop the the job gracefully based on the job description. As a bonus, the job will remove the conditional stop first step from jobs that are no longer AG-aware based on the job description.
The following code will:
- Scan all jobs’ descriptions to identify HADR jobs
- Inject first step conditional stop based on the job description
- Remove first step conditional stop for jobs that have no HADR reference
-- ============================================= -- Author: Steve Rezhener -- Create date: 10/01/2020 -- Description: Create/Remove HADR check for HADR jobs -- ============================================= -- EXEC [dbo].[usp_maintain_HADRcheck4HARDjobs] CREATE PROCEDURE [dbo].[usp_maintain_HADRcheck4HARDjobs] AS BEGIN SET NOCOUNT ON; DECLARE @step_command AS NVARCHAR(MAX) DECLARE @job_id AS UNIQUEIDENTIFIER DECLARE @job_name AS VARCHAR(50) -- add hadr check to all hadr-aware jobs DECLARE agwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR SELECT jobs.job_id, jobs.name FROM msdb.dbo.sysjobs as jobs inner join msdb.dbo.sysjobsteps as steps on steps.job_id = jobs.job_id WHERE 1=1 and jobs.description like '%HADR%' -- include only hadr aware jobs and steps.step_name != 'HADR Check' -- exlcude jobs that have AGSteps1 step already and steps.step_id = 1 OPEN agwarejobs FETCH NEXT FROM agwarejobs INTO @job_id, @job_name WHILE @@FETCH_STATUS = 0 BEGIN -- HADR.AG1.PRIMARY SET @step_command = N'' + ' DECLARE @ag_name AS VARCHAR(50), @replica_role AS VARCHAR(50)' + ' SELECT @ag_name = PARSENAME(description, 2), @replica_role = PARSENAME(description, 1)' + ' FROM msdb.dbo.sysjobs' + ' WHERE job_id = $(ESCAPE_NONE(JOBID))' + ' IF EXISTS (' + ' SELECT 1' + ' FROM sys.dm_hadr_availability_replica_states as hadr_avrest' + ' INNER JOIN sys.availability_groups as avgr ON avgr.group_id = hadr_avrest.group_id' + ' INNER JOIN sys.dm_hadr_availability_group_states as hadr_avgrst ON hadr_avgrst.group_id = hadr_avrest.group_id' + ' WHERE 1=1' + ' AND avgr.name = @ag_name ' + ' AND hadr_avrest.role_desc = @replica_role ' + ' AND avgr.name = @ag_name ' + ' AND hadr_avrest.role_desc = @replica_role ' + ' AND hadr_avrest.operational_state_desc = ''ONLINE''' + ' AND hadr_avrest.connected_state_desc = ''CONNECTED''' + ' AND hadr_avrest.synchronization_health_desc = ''HEALTHY''' + ')' + ' BEGIN' + '' + ' PRINT ''stop the job''' + ' EXEC msdb.dbo.sp_stop_job @job_id = $(ESCAPE_NONE(JOBID));' + '' + ' END' + ' ELSE PRINT ''continue with other steps''' PRINT @step_command IF EXISTS(SELECT name from msdb.dbo.sysjobs WHERE job_id = @job_id) -- just in case BEGIN PRINT 'add hadr check step to ' + @job_name + ' job' EXEC msdb.dbo.sp_add_jobstep @job_id = @job_id , @step_id = 1 , @step_name = N'HADR Check' , @subsystem = N'TSQL' , @on_success_action = 3 -- Go to next step , @command = @step_command END FETCH NEXT FROM agwarejobs INTO @job_id, @job_name END CLOSE agwarejobs DEALLOCATE agwarejobs --HADR.AA.PRIMARY -- remove hadr check from all not hadr-aware jobs = left over due to job description change DECLARE notagwarejobs LOCAL STATIC READ_ONLY FORWARD_ONLY CURSOR FOR SELECT jobs.job_id, jobs.name FROM msdb.dbo.sysjobsteps as steps LEFT OUTER JOIN msdb.dbo.sysjobs as jobs on steps.job_id = jobs.job_id WHERE 1=1 and steps.step_name = 'HADR Check' -- include jobs that have AGSteps1 step = left over and steps.step_id = 1 and ISNULL(jobs.description, '') NOT like '%HADR%' -- exclude jobs that are not hadr aware OPEN notagwarejobs FETCH NEXT FROM notagwarejobs INTO @job_id, @job_name WHILE @@FETCH_STATUS = 0 BEGIN PRINT 'remove hard check step from ' + @job_name + ' job' EXEC msdb.dbo.sp_delete_jobstep @job_id = @job_id , @step_id = 1 FETCH NEXT FROM notagwarejobs INTO @job_id, @job_name END CLOSE notagwarejobs DEALLOCATE notagwarejobs END
This blog post is partially based on the following articles: