Entries tagged as sysadmin
Wednesday, April 27. 2011
Amazon haven’t been having a time of it with their AWS cloud offering; a solid week where they’ve suffered an outage across multiple availability zones has knocked out a number of high-profile clients and gotten them a bunch of bad publicity.
The root cause of the problem, from what we’ve learned thus far, appears to be in Amazon’s EBS layer, their storage subsystem that backs their cloud services. The fact that as much as 0.07% of that storage is now reported to be so badly damaged as to be irretrievable will be the piss icing on the shit cake that has been outages to the hosting environment for any number of large and small companies. It’s tempting to throw rocks at Amazon, and wearing the hat of a systems engineer who’s been hearing “push to the cloud” offered up as a solution to the problem of, well, “giving engineers jobs” for some time, there’s a certain amount of schadenfreude involved.
But what it should serve as is a timely reminder that distributed persistant storage is probably one of the biggest challenges in software engineering; unlike programming challenges, where success is often a matter of utilising different algorithms to produce better results, it’s fundamentally a problem of physics, of ‘c’.
The round trip for a packet from my datacentre in Auckland to my datacentre in Wellington is around 16 ms with a reasonably simple network topology over uncongested links. I know; I’ve measured it, and it’s barely above the latency built in to the speed of light; even perfect zero-overhead networking would have shaved only a couple of milliseconds off. So replicating data synchronously adds 16 ms to every storage transaction, whether that’s at the high level (a COMMIT in a database) or a low level (shifting a block across the network).
For certain types of data, that’s not really a problem, but for interactive applications that may wish to store many transactions per second, it’s fatal. At this point you discover the harsh mistress that is queuing theory, and you start facing difficult choices: do you simply limit peak performance (run as fast as you can shunt packets over a few hundred kilometres), accept data loss (replicate asynchronously), or give up resilience (only replicate as far as latency will let you get decent performance)?
There are mitigants and clever engineering to work around the worst aspects of these problems; you can juggle the size of your transaction units, you can mix synchronous and asyncronous replication, you can do any number of other things, but it requires good engineering to get close to the limits of physics, and you can’t get around them, at the end of the end of the day. That good engineering is why Netflix, for example, survived the outage unscathed, but many others didn’t.
All of which is essentially the preamble to a point, at which I have (finally!) arrived: much of the discussion around “migrating to the cloud” is essentially wishful thinking, and it’s interesting to me how much of it comes from people who don’t do systems engineering, but see managing systems as an annoying problem they’d like to reduce to a black box that’s someone else’s issue. “Going to the cloud” is seen in the context of allowing a company, large or small, to eliminate a whole set of expensive skills from the org chart, with cheaper infrastructure a mere side benefit. This is throwing out the baby with the bathwater. One wonders how many of the afflicted will realise that.
Wednesday, January 26. 2011
It seems like sooner or later everyone ends up writing about interviewing. “My favourite questions” are one of those topics that pop up again and again, but I am so often bemused by what those questions are.
Take a recent StackOverflow query; “What are your favourite questions for a senior Unix admin?”. The responses elicited were almost entirely of the “swallow the man pages” or “trick questions I know” or worse yet “simple stuff I can Google” variety.
Folks, whether or not I remember how to use ldd or strace, or remember that it’s truss on Solaris, not strace, does not make me a senior SA, and if you think it makes you one, I pray for the sake of your ego I never interview you. Those are questions I’d expect that a junior would either know already, or be able to find given a few minutes with the search engine of their choice. Even in the dim dark days before ubiquitous workplace Internet access and decent search engines, one could fairly quickly find the answers with a wall of documentation or, if all else fails, telling a colleague you’ve forgotten and letting them tell you.
What does a senior SA look like to me? A senior SA is someone who solves problems. S/he understands that the role of a senior is to be able to understand the relationship between complex application layers and the systems s/he’s responsible for, so that when someone rocks up to them and says, “This critical application that makes us lots of money isn’t working, and it appears to be a slowdown with this other application it depends on running slow, but we looked at the virtualized server and we can’t work out why it’s going slow”, s/he can get on with pinpointing the fault, not raise their hands and refuse to look at it until the app and network guys prove it isn’t their problem. That’s why I ask questions that are app-centric as well as server-centric. Can a candidate trace down the stack from high-level symptoms to nail the problem? Protip: if your answer at this stage is “not my problem”, we’re done.
A senior SA can sit down with a problem they’ve never encountered before, and even if they can’t solve it, they can demonstrate that they understand the principles of problem-solving, that they know where to look, that they’ll look to the most likely indicators based on the evidence they’ve unearthed. They’ll follow a logical train of thought, and they’ll investigate things in a controlled fashion. And if all else fails, they’ll call for help and beat their head against the problem until help solves it, or they do. That’s why I ask questions that start with a simple problem statement from real-world issues that have paged me at 2 am, and see if the person in front of me attacks it the right way. The result isn’t really relevant—they don’t know my environment in detail, after all—but the methodology is priceless.
(You get bonus points if you mention, “I’d check the change logs to see what changed today” as a starting point.)
Problem solving isn’t just at the break fix level. A senior SA should be able to plan, design, think ahead, solve the bigger problems. I want to know how you work out your patching strategy, how you balance critical security fixes with regular cyclical upgrades and the desire for stable systems, how you’re going to work in with project deliverables. I want to hear your views on the best way to manage authentication. Tell me how you like to automate server deployments, how you deal with guest sprawl in virtualized environments. What information is key to your understanding of capacity planning. And virtualisation? How do you feel about it, anyway? What problems does it solve, what does it create? I don’t care if you don’t know all the answers to these questions, hell, I don’t. Can you ask me good questions when you don’t know the anwers? Can you, in short, talk to me about them intelligently?
I do not fucking care if you remember how to delete a file called ‘-rf’ from the root filesystem off the top of your head.
Tuesday, February 2. 2010
When you have more than a few Unix people together you will end up with vehement, and possibly violent, disagreement, over the Right Way To Do Things. This generally starts with Unix flavour, distribution (if relevant to Unix flavour), and then wends its way through the countrysides of editor wars, MUA disagreements, MTA squabbles, and, of course, mailspool layouts.
Once upon a time, I had a mail server used by people who would ssh into the server and use the MUA of choice, in text mode, as God intended. But then Complications arose. More people began to use it, people who did not like console MUAs. Further more, some of the users became trapped behind corporate firewalls who frowned on tunneling ssh through said corporate firewalls.
So some way of providing access for GUI clients, webmail, and console clients, sometime sharing the same mail spool.
So uw-imap was installed, and it was, well, it only sucked a bit, and squirrelmail was installed, and it didn’t suck much. uw-imapd even supported SSL.
But time passed, and mailspools grew, and soon the users complained. Webmail was becoming unusable, unless you like spending minutes waiting for 20 mails to be rendered in a browser. uw-imap was starting to suck. A lot. It was time to find something that sucked less.
Continue reading "Dovecot"
Monday, February 1. 2010
So yum stopped working on a couple of servers that were registered to servers with what appeared to be a login problem:
$ yum install osad
File “/usr/share/rhn/up2date_client/up2dateAuth.py”, line 217, in getLoginInfo
File “/usr/share/rhn/up2date_client/up2dateAuth.py”, line 165, in login
File “/usr/share/rhn/up2date_client/up2dateAuth.py”, line 121, in readCachedLogin
data = pickle.load(pcklAuth)
File “/usr/lib64/python2.4/pickle.py”, line 1390, in load
File “/usr/lib64/python2.4/pickle.py”, line 872, in load
A yum clean all, the usual remedy to this one... achieves nothing. The server is checking in with the Satellite server, as far as Satellite is concerned, but yum fails on pretty much any operation. Here’s the weird solution: queueing an update in the Satellite server for that box, and then running
…pulls down the update, and then yum starts working again.
(Page 1 of 1, totaling 4 entries)
Ada ada android bikes cars ceph copyright economics egroupware eve farming fatherhood feminism football french funambol gym hi-fi internet Isis Jaques java judo lca2010 lca2012 lego Lias linux maire Maire mangling language movie new zealand oracle perl phil ochs pixar postgresql question of the day racism rails Rosa snark sony-ericsson syncml sysadmin typo uk venting vignette virtualisation wave wtc bombing