Upgrading SQL Server in a Heavy OLTP Environment

In the trenches, how we upgraded the most important SQL Server for our publicly traded client, in a Production environment where 24x7x365 – 100% uptime is expected..

It’s the most important project of the year for the DBA team. Frankly, it’s the most important project for our team in years! Our client’s primary order-entry SQL Server must be upgraded to the latest and greatest SQL 2014!! Yah I know welcome to 4 years ago, however this is the highest we can go given the features we use heavily (replication), and what Microsoft will officially support.

We are 1 week away from our cutover “Go Live”, we have been planning and making strides each day for the past 8 months. Now it’s 1 week away, and the stress is palpable! Review the plan again, really dive into the details, who’s doing what and how. Double check the scripts – what scripts still need to be developed – yes, we are still developing scripts 1 week away, QC scripts mainly but no pressure right.

What it comes down to is how are you going to sync your new server, how to guarantee a perfect cutover. One plan – use SQL Server transactional replication to have the new databases online on the new server, setup all replication to report servers, DR, etc in advance, really eliminate any outage downtime. Sounds great, management loves it, the DBAs in the trenches return blank stares when plan agreed upon, ‘we are warning you’ – but ok we are on board let’s do our best to deliver. Yes we have a backup plan just in case..

                Plan A: Data Replication to sync new server

                Plan B: Log Shipping to sync new server

Plan A – yes we are heavily invested in SQL replication, a natural goto to sync server data with another. There is risk with this method, after a backup and restore of the primary server’s databases you bring the database online, setup replication, and the only objects syncing are tables and data. No stored procedures, views, functions, users and logins, nothing outside of tables and data are kept in sync, and of course the table schema sync options are limited. Hence, the reason for blank stares from the DBA team when told of Plan A must be delivered. Why consider Plan A – databases on the new server are online well in advance, allowing for full index & statistics rebuilds, complete setup and configuration of replication to Off-Site DR servers (Peer to Peer) which is a major burden causing blocking. Just try to get full queiesce (no user data changes) on a 24×7 environment.. With Plan A we can stop the distribution agents delivering changes and get perfect queiesce allowing perfect pain-free initialization of offsite DR.

Day one of enabling replication, we chose only 3 databases (of 100 dbs) in Production, immediate issues. Next business day replication suffers at 6 hours latency on a low volume day.. now a look of ‘I told you so..’ from the DBAs. Team lead, “let’s do one thing”… and five troubleshoots later… not gonna work – let’s move past this – no we are NOT calling Microsoft to troubleshoot further, we are two weeks away from Go Live and is safe to say replication will not work..

Plan B it is.. Log shipping, tried and true backup/restore technology, perfect execution, bullet-proof – perfect cutover for all database objects and data. The only downside, we have to tell management the system will be down for an hour at best, realistically expect 1.5 hours.

One thing with native SQL Server log shipping, this is managed by executables or dll’s, or extended stored procedures outside of SQL Server (no one bothered to really investigate and understand, it is what it is) – we discovered this is heavy on disk when the backups kick off, even if staggered, and causes disk contention and application slowness. So we develop custom log shipping – easy right, Ship-Logging we call it – gotta have a sense of humor and have a laugh sometimes (what rimes w/ ship).. DBA humor I know.

Checklists! Prep your servers and run through your checklists, you have to have checklists: for memory settings, is hyperthreading enabled, traceflags for master and tempdb, are logins created with server level security roles assigned, is db mail for alerting working, is sa account disabled, checklists are important to maintain and improve with each project that involves bringing servers into production. Different Server types have different checklists. For our replication distributor the #1 setting is the desktop heap– google it – don’t get bitten by this one.

Dry Runs! Final preparation meeting as we are 5 days out, we are ready for a ‘dry-run’ we call it. We need to be precise on how long our restores will take, relay accurate downtime to the business, and we can learn and get familiar with the process – by doing a dry run we bring all databases online and run through our upgrade scripts, replication build scripts, enable login and job scripts, and really measure how things flow.. Dry run #1 we learned a lot!!

  • Final tran log backups fly by calling backup command database by database, restores however take time. We have jobs for restores, let’s kick them all off one by one, the response is immediate on a job-start, and all will run at same time. This alone saved 15 minutes easy!
  • Databases missing – really?! Yup – we found 2 only used by DBAs, not end of world but wow, ok, fixed.
  • Watching DBAs share screen while running their parts showed they were not comfortable with their scripts, ordering was off, changes were done on the fly. Imagine if we waited for Go-Live to run for first time.. an hour easy shaved off downtime here.
  • GO missing between statements to align users to logins, our luck the DBA sharing their screen was not the only one that didn’t understand how to use the script generator, 3 others had same issue – problem discovered and solved here saved permission errors, always hard to troubleshoot!
  • Encryption drop key statements mixed with creates, this is easy fix for next time – separate drops from creates, run scripts with F5, bamm, no picking through line by line.
  • Replication scripts failed because subscribers were subscribed to for first time from our new publisher, with thousands of tables this would’ve been a disaster on for Go-Live.. end result – replication scripts cannot be tested, abandon dry-run #1. Easy fix – add dummy publication and article, add subscription to all possible subscribers once manually, done. Now ready for dry-run #2!

Ok – great lessons learned, we need another dry-run, it’s Tuesday, and Go-Live is Saturday, how fast can our after-hours DBAs rebuild Ship-Logging? Bamm, like money in the bank, 100 dbs back in Ship-Logging by our Wednesday morning! What a luxury having DBAs fresh in another timezone to get major work done while the US Team sleeps!

Here we go – Dry-Run#2, yup more learned:

  • Some changes from Dry-run 1 were not understood, DBAs asking ‘Do we run update-stats now or later or when, again?’
  • Accounting Archiving was running – What? Add to notes and another one for checklist.. Disable all archiving, purging, maintenance jobs.
  • Last minute change – our leader wants a remote DBA to run his scripts, immediately squaks over phone speaker – ‘having trouble – can’t connect – check cisco someone..’ Ok – great find here and easy correction, NO ONE can be remote and running scripts for Go Live, period. Everybody running cutover scripts must be on-site – Period. But there was more confusion from this:
    • Tracey shouts- I’ve got Kathy’s script since she can’t connect
    • Sandy – “I’ve got Kathy’s”
    • “Wait Sandy – you are supposed to be watching blocking.. I need you.”
    • Fred “I’ll watch blocking.”
      • Add to plan – Stick with chore assignments! Fluidity like this is great – but let’s have each DBA man their battlestations!
  • Fred discovers that rebuilding the PK index flies on a near 1Billion row table (2 minutes), something we were not able to do in production in a long time. Good find, add to plan: rebuild PK index on all BIG tables before letting users in!
  • Replication now with the add subscriptions fully tested is slow, very slow. From experience we plan to add publications and articles in an iterative process – pub by pub, no blocking is goal – so we test creating multiple pubs in parallel across 3 to 4 dbs at a time – sure enough there is blocking – but they all succeed. Ok, this risk we take for Saturday, run 3 or 4 at a time to shave 20 minutes. Something new to blocking we’ve never seen before – an internal process running sp_sqlagent_refresh_job .. wth?, google it. I guess we live with as we can’t find a job running this on the distributor, or the publisher. It’s some internal IsAlive check for a publisher.

Ok – 2 dry runs done and still problems, but nothing we can’t overcome – no time to do a 3rd, we’ve got valuable intel and we are all comfortable now with each other’s task. Nothing a good meeting or two can’t straighten out.

Meeting time – go over the plan again, question timing and order of everything. Final question asked with all of us DBAs in a room “Where are we going to fail?” Everyone dig deep, where are the weaknesses, what have we not thought about. Some extra QC on scripts we could not run, ask the server team about their steps – are you comfortable, what is your time line etc. In an effort to help the other teams help us, the DBA team can provide scripts to stop all SQL Server Agents on all SQL Server, which ultimately stops any data inbound from our order entry satellite servers. They agree, good, we were afraid of a manual one-by-one approach.

GO LIVE – it’s GO time, it’s Saturday, we go at 10.30PM, sleep in Saturday morning, take a long nap Saturday afternoon is on tap! We have a DBA for prod support Saturday that will fall off when we get in for the afternoon, we have a team in Europe that will take over when we crash and burn at 3AM.. all set.

It’s 9.50PM – Management states – only onesy twoesy orders coming in, let’s go early, lets go at 10! Ok, let’s go, wait where is everyone, outside smoking – or chatting – in the bathroom throwing up who knows – get them – a voice of reason says ‘No, let’s not, we are introducing unnecessary stress and a rush to our battle-stations, stick to the plan..’ Good call – stick to the plan!

10:30PM hits… GO time:

  • Run the disable all logins (except DBAs) script, Go!
  • Kill all connections to SQL Server, run script – Go! Wait, connections not killing, why, check the script, check the connection what is it doing – rolling back – ok good – let’s give it a couple minutes. 10 minutes later… all clear.
  • Breathe – ok – run the powershell command to stop all SQL Server Agents.. go! Guys.. sql agent service still running on OrderEntry #1, and OE #2, what’s going on.. dang – there was something we were told by script author: ‘it’ll work fine, no worries..’ famous last words right. Server Team – go to manual mode, launch cluster administrator on all servers and stop services.. another 10 minutes, 15, .. Done! Go what’s next?
  • Take final backup, restore final backup, run QC script to check for restore finish time, 90 databases 97 done, 99 done, all done! Next.
  • All 6 “On-Site” DBAs running ‘bring online’ scripts for their group of dbs, upgrade compatibility to SQL 2014, dbcc updateusage, Done! Next!
  • Over to server team to change the IP Address to same as retired server, 1 minute allotted here, Go! 15 minutes later… “Who assigned 1 minute? Did you communicate this timeline to server team? Was this realistic?” No biggie – things happen – hard to test timing on this one.. IP switch done, DNS updated, old server rebooted.. DBAs go!
  • I can’t connect, neither can I, me neither, connect by IP not name, DNS failing. Oh dang, we are on Dev domain not Prod domain, DNS trickle down takes time. Ok – connections made by IP, all good, GO..
  • Let QA Team in – Let all the users in – release the hounds!

It’s 12.30AM – total downtime was almost 2 hours, not a major disaster – we didn’t quite meet our goal relayed to the business, however given the severity of our undertaking everyone was happy. Had we not gone through our dry-runs downtime could’ve been 3 plus hours!! And many more mistakes to troubleshoot.

Now note any and all issues as they are reported by the users, some found by DBAs and monitoring tools before end-users noticed, others not discovered until users report.. At the end of the day we work through all issues, minor ones come in over next week or two but we note all and update our project file and checklists. It’s all about being ready for the next big upgrade…

I hope this ‘In-the-trenches’ so to speak helps other teams and or DBAs manage your next upgrade project, learn from our mistakes and our successes and your team too can be the Heros!