BennyEast.Com/Blog The official blog of Kenny West

17Sep/150

Swiss Cheese Model

This isn't a blog post about models from Switzerland that walk with slices of cheese down a catwalk/runway while people snap photos and speak in French accents about how dashing the cheese looks at this year's cheese fashion show.

No, that would be a a Swiss, cheese model.

This blog post is on the topic of... The Swiss Cheese Model.

 

Swiss cheese model of accident causation.png
"Swiss cheese model of accident causation" by Davidmack - Own work. Licensed under CC BY-SA 3.0 via Commons.

The other day I'm on a run listening to some educational content and the person giving this particular talk mentions something called the Swiss Cheese Model of accident causation.  I'd never heard of this before!  I geek out about learning new things.  Give me more!

Immediately I'm like that's so fascinating.  So, after the run I go to the local coffee shop and I read up more about it.  Here's a couple of links with some more information...

https://en.wikipedia.org/wiki/Swiss_cheese_model

http://patientsafetyed.duhs.duke.edu/module_e/swiss_cheese.html

Basically the idea is that when things go wrong, when big things fail, like a project at a workplace (which was what this guy was mostly discussing), it's not because of one thing.  It's lots of little things.  The guy in the podcast basically talks about failure in general... anything from large catastrophes to workplace projects not succeeding...  He started his talk about airplane incidents and accidents.  And how when they dissect what went wrong, usually they find that it wasn't just one thing that failed.  It was a whole chain of events leading up to the event.

It's lots of little things that all just happened to line up to fall through all these different layers of holes in a bunch of swiss cheese slices.

I've never heard of this model before, but it's really neat stuff.  At least I think it is.  It's a good model to try and create layers of accident prevention to prevent things from failing.

The idea is that everything has flaws and errors.  People make mistakes, systems have defects, software has bugs... nothing is ever going to be perfect.  So failure is going to happen in at least the smallest aspect.  But trying to create layers or checkpoints so that a whole entire chain of failures doesn't happen all in a row can be a good way to prevent large scale failures.

So I start thinking about how I can use this model at my own workplace to apply it to the process of how our department is run.  My entire job is pretty much failure.  People don't usually come to helpdesk in IT when everything is working great.  They come when things go wrong.  And sometimes we get failures that will be large and small.  Sometimes you get a single computer that crashes, or a printer that goes out to lunch, but other times you get things that affect LOTS of people and then you get a million messages all at once because a widely used service that has gone offline.

One of the places where I use this type of model already actually, is in network design and security.  One of my job duties is maintaining, deploying, replacing network switching equipment.  There are very simple switches that connect up computer devices, but when you get to anything beyond a small business office, you need to start using managed switching devices and routers for more advanced security measures... etc. etc.

That stuff can get rather complicated.  Especially when you get beyond the thousands point on your network.  All kinds of network nodes all linking up.  At each section of the network we have stop points.  We have protocols that are programmed into the ports that will cut service to that piece of equipment if something goes funky.  If a machine has a bad Network Interface Card... or gets infected with a virus and starts flooding the network... That port, or that switch, or that entire branch of the network will be cut off.  Without that stop point eventually the entire network would go down.

It's a very basic networking 101 thing.  If you take an ethernet cable and you plug it back into itself it will start to create a feedback loop that can cause the network to become completely unusable.  Don't try this at your workplace because your IT admin will get very angry with you, especially if they don't have the proper protocols in place on the networking equipment to stop or cut service to the switch that connects your office to the rest of the network... but, let's say... you wanted to attempt to take things down for a bit so you could go home and nap for the afternoon, or go lounge out by the pool, instead of doing work...

Let's just say hypothetically... So ok, maybe you have two network jacks in your office, a printer and a computer, unplug both.  Then take that network cable that plugs the printer and the computer and plug each end into the two jacks.  So now instead of it going from the wall to the back of the computer, or from the wall to the back of the printer, it just goes from one wall port right over to another wall port.

Then just wait...  Eventually those jacks should both go offline.  Depending on how it's all setup you may need to call IT to get them reactivated.  Or they may just come back to life on their own after a period of time.

Ours go offline indefinitely until we can fully investigate the situation and why that port shut down.

It's stupidly simple and small... but without it something as simple as plugging a network cable from one jack, right into another to create a loop can take an entire network offline.

I've seen where something can start super small and simple and then cascade to something large.  Like for example a very basic basic error is simply having a server run out of disk space.  If a server... like a mail server for some reason has some program or service that starts generating log files...  And maybe it's just creating little tiny log files very slowly... But let's say it just all of a sudden starts creating a million log files overnight while everyone is home sleeping...

That morning all of a sudden there's no email flowing.  Without knowing that the mail server simply ran out of space...

Anyone could guess what sort of issue it could be that is causing this.  But then no email flowing could lead to missed appointments or maybe even if there's something else going on like the need to send out an emergency notification about a storm that is rolling in and for everyone to move their cars from a lower parking lot that may flood...

Now all of a sudden some stupid log file that chewed up disk space that took down a mail server that made it so that you couldn't send out emails to everyone warning them to move their cars in the next couple hours...

Leads to a dozen cars being swept away in a river.

Of course this didn't actually happen... But it's the best example I could think of at the moment.  We have had log files go out of control though and stop services on servers.  But if you have proper monitoring applications that send out warnings when disk space is getting low.  Or you have redundant communication systems and ways for people to communicate if one form of communication goes down you can add additional layers to the Swiss Cheese Model.

If you only have two layers until failure and those two holes line up coincidentally...  Failure.

But if you have five layers, or ten layers... You can then prevent things better.  Of course there's a cost to everything.  So it's about weighing the cost to implement and purchase hardware and software.  The cost of labor to deploy and maintain things.  The cost to the user if added steps take away from workplace productivity.

How many people would shop with a credit card if every single time you tried to buy something swiping the card you had to do more than sign your signature.  Fraud prevention could drop to zero if they added enough steps but who's going to wait for a text to be sent to your phone with a key code then have the vendor or place of business call a credit card representative with that key code and price to verify that transaction then have you open an app on your phone to see the verification pop up then have you click approve and accept and wait for a representative to call you back with another approval code and verification of your date of birth, address, and your favorite zoo animal.

I JUST WANT TO PAY FOR THESE FLIP FLOPS!!!!!

"I'm sorry sir.  Please hold while we transfer you to the next available purchasing representative.  Your call is important to us and we will answer it in the order that is..." click.

Oh, screw it, I'll just go barefoot on my vacation.

It's a balance.  Enough layers of security and prevention without hindering performance.

But anyways check out the Swiss Cheese Model.  I think it's pretty neat stuff.

Filed under: Stuffs Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.