Thursday, May 22, 2014

3 billion row table...

So, no lie, now, we literally have a 3 billion row table at work.

It is that large because when it was designed by the developers there was not data maintenance plan for it, despite my pleading for them to put controls in.

So, we have this 3 billion row table with no timestamps of any kind or any kind of quality identifier to denote what would get thrown away.

This morning, the devs decided to roll some changes into production which the DBA group has never gotten a chance to review.

They decided, wow, let's alter this three billion row table.

Uh, what?

So we stopped the deploy and decided since it was largely a "recycle bin" table we would move it to table_old and create a new blank table and then using some logic pull back only good data gradually.

Sounds like a great idea!  It was...

Actually it is, but how to do this.

We have a ton of crap in the table, "maybe" 1/2 of it is good.  It's all fragmented and twisted together.

I started selectively pulling the "good" data over in chunks of 5x25k.  It would work well for a while, then get slow.

The table got fragmented, because gradually, the more good data it moved, it would consolidate all the bad data at the end of the table.  It did this because we decided to move the data in reverse PK order, because it is most likely that the only "wanted" data would be newer data.  Since there is no timestamp of any kind on the rows, reverse PK order was the easiest way to say "hey this is probably newer."

Eventually, I realized that you could then delete a bunch of cruft from the end of the table after you have moved a good chunk of the valid data.  This worked well for a while.  Once it got into the good data area again, eventually the trimming of the bad data would slow down, and the good data migration would be faster again.

This ebb and flow continued for a while until I saw a pattern.  Anytime the migration or the trimming would take more than 12-13 seconds, the other one would go faster.

Solution for hands free, worry free automation?  An interleaved python script of course:


 while True:
  print "Running migrate task"
  runtime = mustRunOK("Move "filtered" X number of rows from fat table to slim table in desc pk order",0);
  print str(runtime) + " seconds to complete migrate task"
  if (runtime > 30):
    time.sleep(180);
  if (runtime >= 13):
     running = True
     print "Running trim task"
     while (running == True):
       runtime = mustRunOK("Delete bad X number of rows from fat table to trim the end of the line in desc pk order",0);
       print str(runtime) + " seconds to complete trim task"
       if (runtime > 30):
         time.sleep(180);
       if (runtime >= 12):
         running = False
       time.sleep(30);
  time.sleep(30);


Every time the migrate task takes > 13 seconds, it will start trimming discarded fat, bad data from the end.  Once the trim takes > 12 seconds it will switch back to migrating good data.

If at any point either task takes longer than 30 seconds it will sleep for 180 seconds as it assumes there is load on the DB and it should wait a little while.

Hope this helps ya.  It did me.