Saturday, September 20, 2014

Fragmentation and Linux + Android + OSS problem



People who know me know that for many, many, many years now I have been a linux and OSS zealot. 

Unfortunately, my support for this model is starting to waver in the business arena.  Linux is now starting to suffer from the "me too" syndrome that Android is suffering from.  The openness of the platform is allowing too many forks and too many projects to otherwise obfuscate products that deserve more spotlight, notoriety, fanfare, and adoption. 

It is time for some of these distributions and really the the community in general to it's own internal "audit" and say "is this project really necessary" and "is that project really necessary?"

Just because someone can write a program/app/script that can output "Hello World" does not necessarily qualify them to write anything else that should be used in production, let alone adopted on a wide spread basis.

I guess in the end the pendulum has to swing back to where it started.  Microsoft's model may be "barbaric" but it is a cleaner implementation than "which leaf does I pick from which branch from which tree?" 

Really?

Thursday, May 22, 2014

3 billion row table...

So, no lie, now, we literally have a 3 billion row table at work.

It is that large because when it was designed by the developers there was not data maintenance plan for it, despite my pleading for them to put controls in.

So, we have this 3 billion row table with no timestamps of any kind or any kind of quality identifier to denote what would get thrown away.

This morning, the devs decided to roll some changes into production which the DBA group has never gotten a chance to review.

They decided, wow, let's alter this three billion row table.

Uh, what?

So we stopped the deploy and decided since it was largely a "recycle bin" table we would move it to table_old and create a new blank table and then using some logic pull back only good data gradually.

Sounds like a great idea!  It was...

Actually it is, but how to do this.

We have a ton of crap in the table, "maybe" 1/2 of it is good.  It's all fragmented and twisted together.

I started selectively pulling the "good" data over in chunks of 5x25k.  It would work well for a while, then get slow.

The table got fragmented, because gradually, the more good data it moved, it would consolidate all the bad data at the end of the table.  It did this because we decided to move the data in reverse PK order, because it is most likely that the only "wanted" data would be newer data.  Since there is no timestamp of any kind on the rows, reverse PK order was the easiest way to say "hey this is probably newer."

Eventually, I realized that you could then delete a bunch of cruft from the end of the table after you have moved a good chunk of the valid data.  This worked well for a while.  Once it got into the good data area again, eventually the trimming of the bad data would slow down, and the good data migration would be faster again.

This ebb and flow continued for a while until I saw a pattern.  Anytime the migration or the trimming would take more than 12-13 seconds, the other one would go faster.

Solution for hands free, worry free automation?  An interleaved python script of course:


 while True:
  print "Running migrate task"
  runtime = mustRunOK("Move "filtered" X number of rows from fat table to slim table in desc pk order",0);
  print str(runtime) + " seconds to complete migrate task"
  if (runtime > 30):
    time.sleep(180);
  if (runtime >= 13):
     running = True
     print "Running trim task"
     while (running == True):
       runtime = mustRunOK("Delete bad X number of rows from fat table to trim the end of the line in desc pk order",0);
       print str(runtime) + " seconds to complete trim task"
       if (runtime > 30):
         time.sleep(180);
       if (runtime >= 12):
         running = False
       time.sleep(30);
  time.sleep(30);


Every time the migrate task takes > 13 seconds, it will start trimming discarded fat, bad data from the end.  Once the trim takes > 12 seconds it will switch back to migrating good data.

If at any point either task takes longer than 30 seconds it will sleep for 180 seconds as it assumes there is load on the DB and it should wait a little while.

Hope this helps ya.  It did me.








Wednesday, April 23, 2014

Protip: When bean-counting a 100 million+ row table,

Protip:  When bean-counting a 100 million+ row table,

It is easier to make a summary table then sort its results than have MySQL do this:

select item, count(*) as num_deletions from DeletedCrap group by item order by num_deletitions


Do this instead:

create table report
(
def bigint not null default 0,
deletions bigint not null default 0,
primary key (def_id)
) engine = innodb;

insert into report (select item,1 from db.DeletedCrap) on duplicate key update deletions=deletions+1;

Select * from report order by deletions desc limit 1000;

(I've found that one finishes... the other doesn't)

P.S.

The other fastest way is to use something like PHP where array keys can be dynamically created, and increment dynamic array bean counters as the data streams in from the DB.  The above way while faster still incurs transaction overhead the PHP way does not.

Saturday, April 12, 2014

Why don't you use more tools for MySQL/MariaDB?

This is a question I get asked a lot.

Rather, I also get heckled for it.  Even by people I hire.

For me, it kinda goes back to using a calculator in math class.

A lot of DBAs, developers rely on tools to get the job done, and that is fine. 

But what if you were in an environment where those tools were not allowed?  What if you were in an environment that was foreign to you, and you needed to help fix a problem straight away?

What if there was no google? Oh noes.

Tools are great, they make our lives easier.  But don't you want to know how MySQL is doing something, or why?

Friday was a perfect example.  One of our internal clients has circular replication setup between two machines.  One of the machines ran out of disks pace (Bad Shinken for not finding that in time.)  The VIP was setup to be a smart active-active VIP so after the fact we had one machine in the middle of some updates, and we had another machine that crashed, got fixed, started taking updates again, crashed, and had to be fixed.

You may be able to find a tool on the web that claims I can fix your data in 10 seconds or less.  Unless you have read it's source code, and it is extremely important data, do you want to trust that tool to it?

What if you knew how to use mysqlbinlog yourself export out the updates from both machines as best you can, give it a proper known start and stop point on each machine, a known state that you trust because you know what you are doing, apply the updates through the mysql cli, reset replication to the last exported and execution log position, and voila.

How do you figure out the start position for the mysqlbinlog export? By looking at the last executed log position from mysql's show slave status of course. :-)

The tool may have worked, but you took it on faith.  If you know how to do it with the tools MySQL gives you natively, there is nothing to take on faith.  You see it work right then.  More importantly doing it this way you can see when something fails or goes sideways during the repair (if it does) and you can react better to it, because you see the failure.  You don't have to troubleshoot where the failure occurred.  You see it happen, you know exactly what line.

For me, bottom line, it is great to use tools.  It is also great to know how to do it with what MySQL or Linux gives you out of the box.  

Auto Increment and InnoDB

Auto Increment is one of the neat pieces of functionality that MySQL has.  It is not implemented the way other database platforms do it.  Typically, a platform like T-SQL (Postgres, Microsoft) will create an internal UDF/SP that does work behind the scenes in a sequence table.

When MySQL initially created Auto Increment, they did so for MyISAM.  When it came time to implemented in a transactional storage engine, they changed the game a little.  It is both good and bad.

To get performance MySQL stores the Auto-Inc current value for a table in memory.  It doesn't actually flush the current value of the counter until some event forces MySQL to do so (a server shutdown/restart or flush perhaps.)  I have seen MySQL crash in the middle of operation and reset the counter back to the last flush and in weird circumstances start handing out the same IDs generating duplicate keys.

Further, because InnoDB is transactional, and the counter is cached in memory until a flush, this is one of the few times where when generating a new sequence will cause InnoDB to do a full table lock for just a split microsecond.

Microsecond you say? That's hardly a horrible amount of time to wait, and to a human that is true. To a computer thousands of updates per second, each microsecond can feel like a life time.  We don't want our databases feeling old ahead of their time.

I present to you...

The counter UDF/SP.

DELIMITER ;;
DROP FUNCTION IF EXISTS calc_id;;
CREATE FUNCTION calc_id(_n BIGINT) RETURNS bigint(20)
    DETERMINISTIC
BEGIN

declare n BIGINT;
declare x BIGINT;

set x = (floor(_n / @@auto_increment_increment)) * @@auto_increment_increment;

set x = x + IF( (@@auto_increment_offset = @@auto_increment_increment),0,@@auto_increment_offset);

if x<=_n then
  set x=x+@@auto_increment_increment;

end if ;
return x;
END
;;
DELIMITER ;



DELIMITER ;;
DROP FUNCTION IF EXISTS get_id;;
CREATE FUNCTION get_id(_seq_idx char(32)) RETURNS bigint(20)
    DETERMINISTIC
BEGIN
  UPDATE sequences SET seq_id= @_seq_id:= calc_id(seq_id) where seq_idx = _seq_idx;
  return @_seq_id;
END ;;
DELIMITER ;


DROP TABLE IF EXISTS sequences;
CREATE TABLE sequences (
  seq_idx char(32) NOT NULL,
  seq_id bigint(20) NOT NULL DEFAULT 0,
  PRIMARY KEY (seq_idx)
) ENGINE=InnoDB;


This little snippet of code has been tested for several years in multiple large scale environments.  I am sure there are ways to improve some of the internal logic, but when it can handle six digits of updates per second on a MySQL (ndb) cluster or five digits of updates per second on a single InnoDB master who cares?

Insert into sequences VALUES('users_pk_idx',0);
Select get_id('users_pk_idx');


This code carries some nifty side effects that eventually developers smile about.
  • To make sure it is happy in any replication environment, I always advise the developers to call the get_id() from code.  Then s/he will explicitly insert the PK value into the query.
  • This means the developer actually knows sequence value ahead of time.  They can plan other code and strategies around knowing the ID before it is inserted.
  • You no longer have to call mysql_insert_id() to get the last insert ID, because you already know what it is.
  • There is no more global locking of the table for microseconds to milliseconds (depending on how many tps you are having on the server)
  • It also follows the protocols for auto-increment ID skipping, so if you are in multi-master environments you can keep that.  
  • The auto-increment number becomes atomic and is as good and durable as to how you setup InnoDB to be. :-)
It is a little more work on the front end, but once you get used to it, you start to like the flexibility of it.  You can even maintain counters for fields that don't need to be PKs, or even UNIQUES explicitly.



Friday, April 11, 2014

An amusing introduction...

Well hello there.

Commonly, I like to tell people I play a DBA at work.  Apparently, I've done it so well I have everyone fooled into believing I am a good one. :-)

I've decided to start sharing the amusing, quirky, out-of-the-box thoughts I have on databases, architecture, technology.

Since you are still here, you will occasionally think I am nuts, or psycho in what I write.

Don't worry.  Anything I put here will have been tested and used in large scale production environments.

See you back here, soon...

Cheers,

DrSQL