9 tips for smart software engineering
This started off as a project postmortem. It quickly morphed into a caffeine-induced fever rant, supplemented with life software engineering tips picked up from various projects and readings. It was later refined, censored, and packaged for general consumption. Some of the original fever rant may have snuck through editing. Warning: These are the opinions and views of an Extreme Pragmatist and not necessarily the views of Workiva.
Tip #1 – Know your requirements.
Below are the often unwritten requirements of any project you’ll work on. Regardless of whether or not they are listed in a formal set of requirements or discussed during the initial discovery phase, they’re hard requirements. Don't ever kid yourself and think that the function you're providing is so important that it should not be subject to the following:
Often during initial design discussions you will hear, "Let's get it working, iterate on it, and get it working better!" This is a typical agile approach and works very well for prototypes, but often even as a prototype, it doesn't add much value.
Imagine if a car manufacturer decided to build a high-end engine: Let's just get a basic car going, get a preliminary engine in there, just to get the wheels turning, we'll get the exhaust, fuel injection systems, brakes, tires...we'll put in the seats, get the windows in, get the paint job done, get a car dealership on board, get ready to ship the car, and then once we have that done...then we can go back and make the engine faster.
Guess what? We need a more aerodynamic body, the front-end needs to be longer, the exhaust and fuel injection system have to be replaced, the brakes need to be reworked, we need new tires, and since we have a ton of speed now, let's put a tacky paint job on it to match.
Granted, the car analogy isn't perfect. Since when has adding door locks slowed your car down to a crawl? (Cough) Permissions checking! (Cough) If you know enough about where you want to go, you can abstract away some of your performance bottlenecks into modules that can be worked on separately as long as you know how to interface with it. That being said, you won't know where you want to go without some up-front design that keeps performance in mind.
It’s rare to get into any system design without scale coming up, and often, it’s written in the requirements exactly what the scaling expectations are. Often scaling requirements are even surpassed. Yay! However, it is uncommon that these scaling expectations or limits are communicated effectively for the next system component. Boo!
Interfaces between components rarely define or document scale since it’s assumed that user-generated traffic/content/whatever will never hit the expected maximum it supports.
Consider the following: Imagine the largest load of user generated content we've ever seen—let's call this terminal load. Now, “Verify Widgets” and “Save Widgets” were both designed to handle 2x terminal load. Verify Widgets verifies the user-generated content and throws it over to Save Widgets, which does some processing and saves it.
Verify Widgets was designed so well that it can actually do 5x terminal load. As an optimization, we've started processing content from multiple users concurrently. Yay—speed improvements everywhere! Don't bother telling Save Widget's owners...everything seems to be working great.
Save Widgets handles this well at first, since it can actually handle 4x terminal load. However, at peak time—quarterly or annually—passing in multiple users' content concurrently we start blowing Save Widgets out of the water.
Best case scenario: Save Widgets throws a hard error, and we have to release an emergency hotfix.
Worst case scenario: Save Widgets just keeps chugging along silently as overflow data is lost or only partially processed. The problem goes on quarter after quarter, compounding sporadic data loss issues.
In the worst case, by the time an engineer has time to focus on it and realizes the breadth of the issue, the amount of data lost is staggering. The engineer’s face goes pale, heart sinks and curses are uttered at the creators of Verify Widgets and Save Widgets. Who the hell is Brad MacLeod anyways?
It's not good enough to know your limit. Communicate your limit. Shout it from the roof tops. Include a test case that asserts an exception is thrown past a certain limit. Throw a hard exception at the API layer. Call out the limit in your docstring. Include a clearly defined batching API for consumers who will go over limit, even if you don't believe it will ever happen.
In terms of scale as a requirement, it’s not good enough to scale—you must communicate your limitations and reasons for those limitations.
Also beware of a skewed design that favors scale over performance. Often the very large scale cases represent less than a 1 percent use case. When an over-emphasis is placed on scale, the common-case may be overlooked. Done properly, you can have both scale and performance.
Software resiliency is often overlooked since it usually comes for free on a local system or dev environment. In a distributed environment, resiliency is a hard requirement. A 0.1 percent error rate may seem small, but when you think of user-facing scenarios, where each individual user may perform an action 1000 times per week, a 0.1 percent error rate becomes a once a week occurrence, and that's bad for business.
It's not good enough to say you have resilience. You need to prove it. Often building the framework to prove your system is resilient is more work than making your system resilient, but is completely necessary to reach 5–9 reliability.
Usability is absolutely necessary at three levels:
- Public facing APIs
- Internal facing APIs
Number 1 and number 2 might be optional, but number 3 is not. An internal facing API is an absolute hard requirement, and if you don't budget time for it up front, you’ll pay for it immediately after your project has completed. Internal support teams need a view into your component that may not be necessary for external consumers. Other developers need a clean and flexible API into your component, unless you want to support that component and all derivative works for the rest of your career. Aside from the above, often unwritten requirements, there will be many others. Budget time for a discovery phase, and benchmark different approaches of the core problem before you even think about prototyping. Realistically, you won't capture every requirement up front, and you shouldn't expect to. Developers must be flexible and adaptable enough to deal with changing requirements, but there is a middle way. Do some research. Do not run full tilt into the night with your eyes closed. My friend broke his shin doing that.
Tip #2 – Throw it away.
In the work environment, nothing feels better than seeing the faces of stakeholders, with their jaws hanging open over a prototype you spiked out on your sick day. Leaving people in shock and awe over something you created yourself practically overnight is incredibly gratifying, and it’s easy to become attached to your creations.
If you don't understand the depth of the problem, certainly prototyping can help with that. That’s all it is—a learning tool. Remember, if it comes too easy, it's not worth keeping. It doesn't matter how much time has been devoted to it—as new requirements emerge it may become obvious that what you thought was a Ferrari is actually just a Honda. Maybe you can salvage the window wipers or the door handles, maybe not.
Don't waste time building a lemon. Don't get sentimental. Don't cling to mediocrity.
Only use high-end parts. If you know something has warts and you're not committed to fixing it, don't use it. If you must use it to get started, spend the time to write a wrapper around it, so you're not tightly coupled to it. Writing an interface in front of it will let you easily swap out to another system. Never let garbage juice from a spike leak into your core code.
Tip #3 – Know your platform.
Generally, most developers know their platforms fairly well, but it can take a special kind of complexity to shine a spotlight on some of the low-level bugs or limitations. There’s always going to be more to discover about whatever platform you're using, and as a clever developer, you're going to find ways around the limitations. If you can't, you're going to consider a new platform.
In conversing with most developers, it’s quite evident that they know their platforms thoroughly. When it comes down to reviewing code or running into bottlenecks, it becomes clear that what’s known is often forgotten during coding. This pain can be especially acute when your code is tightly coupled to your platform.
For example, the Google App Engine™ platform is great at some things and not so great at others:
- Serving web requests quickly: Great
- Long-running tasks: Not-so-great
- Scale up to handle massive amounts of traffic: Great
- Spin up time on very large modules: Not-so-great
- Queue for non-time-critical deferred processing: Great
- Time-critical low latency task processing: Not-so-great (Arguable, but it's still pretty amazing.)
Do not design with heavy dependence on weak areas of your platform. Again, abstraction is your friend when you need to swap out how something is run without changing your core functionality. Consider alternative implementations that take advantage of your platforms strengths and/or consider another platform.
Tip #4 – Prove out your bottlenecks.
If any aspect of your system feels slow to you, it feels like an eternity to your users. You're looking at your system with rose-colored glasses because you love your system—it's your baby. If that's true, become the parent that all children hate. Hold them to impossible standards: Nothing is good enough.
"What'd you get?"
"I got an A."
"Let me see that report card. A 95? You're grounded!"
In software development, metrics are your system's report card. I cannot stress the importance of capturing metrics enough. If your kid is hiding his report card, you can assume the worst. There’s a sinking feeling you get when you realize the performance improvement you've been chasing for two weeks is just a micro-optimization. The only way to know the difference is through careful measurement and analysis. Data alone won't tell the full story. Analysis is key. Usage patterns must be fully understood to determine priority of optimizations, what’s worth going after, and what kind of testing tools are required. Optimize for your most common use case.
Be wary of smoothing. It's always nice to smooth out your averages by only looking at 5–95 percentile, but what are your min and max times telling you? How big is the variation there? Can you explain that variation? Why or why not?
This doesn't just apply to your system, it also applies to your project team. Debug the process. Why is the project slowed down? What are we doing wrong? Who's fault is this? What can we do? Without data, chasing team optimizations is often fruitless. Even if you implement a working optimization, how do you know whether it was a real performance improvement or just another micro-optimization?
Analysis must be backed by data. Decisions must be backed by analysis. Actions must be backed by evidence. This does not mean you should omit metrics from your review where the root cause is not fully understood. Bad performance metrics should have a light shone on them as early as possible. If you knew all the numbers and all the whys, it's likely because your project is already complete.
Tip #5 – If QA is your bottleneck, maybe you're the problem.
Keep track of how much goes back to rework. Insist that every time a ticket goes back to rework, it gets a comment on it. The reason should be clear, concise, and high-level.
"Reason for rework: Merge conflicts."
"Reason for rework: Logical merge conflict."
Back to rework for these reasons? Ask yourself:
- How many people are working in the same logical area?
- Is there a clear division of responsibilities?
- Are the developers assigned to work on this properly communicating and coordinating?
- Is the code too tightly coupled causing small changes to have a ripple effect?
- How long has this been sitting in QA?
"Reason for rework: Scope change."
Back to rework for this reason? Ask yourself:
- Was the scope properly defined to begin with?
- How long has this been sitting in QA?
"Reason for rework: Insufficient unit tests."
"Reason for rework: Insufficient developer testing." (Ooh, controversial.)
Back to rework for these reasons? Ask yourself:
- Is this a pattern with a particular developer?
- Is it reasonable to expect that the developer would have known to test for this?
- Is this QA some kind of half-human-half-machine, capable of anticipating every permutation of user interaction possible? If so, how do we clone this person?
When using “Reason for rework” on tickets, it’s important that they’re short, general, and fit into some category that will not take a significant amount of time or effort for a QA or engineer to fill out. A reason for rework should never look like this:
"Reason for rework: The intersection of two sets, on line 283 does not meet the expected criteria because of a hack introduced by another team several years ago."
A reason for rework like that is put together to diffuse responsibility and hide the fact that there was insufficient developer testing. Own your failures—it gives you the power to resolve them.
Do not subvert the process. Do not make deals with QA to sneak in a commit without pushing the ticket back to rework. As an engineer, own up to the cases where you've had insufficient developer testing. In the words of the late Tupac Shakur, "I ain't mad at cha." Reworks are bound to happen. This is an opportunity to grow and minimize their recurrence. Ask your QA how it was spotted, and commit the technique to memory, so you can do that testing or include a unit test for it before you send your next ticket over the wall.
If you're finding that the answer to "How long has this been sitting in QA?" keeps coming up "Too long," it's time to take action. First, identify how bad your QA backlog is and try to understand why. If tickets aren't being filled out with proper testing instructions, scripts where necessary, or tooling to assist QA, then it’s on developers to pull their weight. The length of time in QA is inversely proportional to the amount of time a developer has put into making his or her ticket testable.
Additionally, take a look at how your tickets are being broken up. When a ticket is tested by QA and a bug is discovered that’s only tangentially related, are you breaking that bug out into a separate ticket? How necessary is that? When you break a ticket up into several smaller tickets, you're multiplying the work QA has to do. At the same time, a ticket that runs over scope is more likely to go back for rework or become so large that it’s impossible to review. It’s a delicate balance, and it takes time to get it right.
If, in the very unlikely situation that QA is really to blame for the bottleneck, push your concerns up the stack. Do you need more resources? Do you need a QA with a particular skill set or expertise from another team? If you don't make your needs clear, you’re setting yourself up for failure in the future. Imagine this conversation between a product manager (PM) and a stakeholder (SH) at sprint review:
PM: "And, we did all our work, but QA is just taking a long time, and that's why we didn't hit our target this Sprint."
SH: "Really? Why are they backed up?"
PM: "I'm not sure, it just takes a really long time. It's QA’s fault."
SH: "Well, what did your QA resource say?"
PM: "She said she’d be caught up in two days. But that was a week ago."
SH: "What did she say when you followed up?"
PM: "Oh, I didn't follow up. It's her fault."
SH: "Did you bring it to your delivery manager? Did he bring it to their boss?"
PM: "I think so?" This sounds so incredibly lame. Refuse to be that PM or team lead or whatever. It’s better to take responsibility for failure than point fingers or make excuses. Pointing fingers breeds contempt across groups and leads to dysfunctional silos. If you're starting to feel a rift between your team and another team, change your mindset: We’re all part of the same team, and we all have the same goal. Keep communication open, honest, and respectful.
Tip #6 – Swarm and scatter.
From what I can tell, the swarm strategy is a relatively new idea (or maybe just a new name for an old idea?) and is really about expediting the completion of an initiative by adding more developers.
Swarming using multiple teams to expedite work on geographically separate aspects of the code can work very well. The key term here is “geographically separate.” I don't mean geographically separate across the globe but rather just across the code base. It's nice, as back-end developer to be able to call in the front-end mercenaries when you need some code stood up quickly. Back-end and front-end developers iterating concurrently can really improve designs by making practical decisions about the division of responsibilities between client and server.
Very small groups can work within close code proximity if they have tight coordination. This, of course, leads to a higher communication overhead between developers, which means less time to code.
A scatter pattern is like a shotgun blast—two or three BBs in the same tight area, but the bulk of the blast hitting a wider general area. Those small groups are laser-focused on delivering and try not to overstep their boundaries. They do this by creating new tickets to address issues in areas that are not part of their laser-focus and assigning them to another small group.
As a developer with strong moral fiber, you may feel conflicted about not touching tech-debt you come across, and instead creating a new ticket to address it. You will feel vindicated later when you see that another team has addressed it, and you avoided a merge conflict.
Tip #7 – Cancel your review.
Preparing for a review can side-track progress for up to three days out of a two-week sprint. This is especially painful when you have committed to hitting a specific target and haven't factored in the laborious task of gathering the metrics required to prove it.
If you need more time to put your sprint together or you really don't have anything important to say, provide the bare minimum updates to your stakeholders, and push the review back. Don't put your team through excessive stress just because you have a recurring meeting on your calendar that says you're going to show slides.
Lesson #8 - Be your harshest critic.
"The end-user experience might suck, but the product is better!" You're right—the end-user experience does suck. And you're wrong—it is NOT better.
If you have any feelings or doubts in your mind whatsoever about the quality of your feature/product/system, then you've got a problem. If you don't feel like you're getting what you need from a team, talk to them every day until you are.
Bring in QA. Bring in UX. Trial it with other internal users. Get some beta customers on board. Invite senior members from other teams to code reviews. You'll find at least one voice who doesn't suffer from Groupthink. Escalate their concerns, and deal with it NOW.
Tip #9 – Fear nothing.
"Here be dragons!"
"Have you seen that function? It is terrifying!"
"That whole module is black magic!"
Language is culture. Language that perpetuates fear and mysticism promotes dark-age thinking and can cause developers to become complacent. When younger developers hear older, more experienced programmers mention something is magic, they may give up trying to learn it. When most non-developers see us coding, they think what we're doing is black magic. We're developers—we don't believe in magic.
More appropriate, and inspiring, responses to (ahem) non-conventional code might be:
"It seems overly complex," but it might not be.
"It's ...interesting," and I need to get to the bottom of it.
"Needs a serious refactor," or maybe I need to learn more about the design pattern being used.
Don't be afraid to sink your teeth into the unknown. Rip code apart, put it back together, spike a new feature into it, turn a library on its head, write a new library that's faster than the built-in, take chances, make mistakes, get messy, cross-train with rapid response, take sky-diving lessons, commit to a timeline, ruthlessly execute your objectives—fear nothing.
Google App Engine is a trademark of Google Inc.
About the Author
MacLeod Broad is beyond a job title. He works on software system designs, cross-team initiatives, and writes code as often as time permits.