Let’s assume your app is currently in production, and has a non-trivial number of users. By non-trivial I mean a number that makes it impractical for you to write a personalized apology email to each and every single one of them when you lose their data. When you reach that sort of penetration, every time your developers touch the UI or anything directly adjacent to the UI it is bound to break someone’s workflow.
You might think you are fixing a long standing UI bug, or making the user interface more consistent and therefore user friendly, but it does not matter. At least one of your users probably worked the side-effects of said bug into the way they do things and it will appear broken to them afterwards.
Let me give you a few examples from personal experience. This is not a project I am personally involved in at the development side. For once I am actually sitting at the user-end and watching the fireworks. Let me set the stage for you: we have been using a third party time tracking tool for ages now. When we first deployed it, it was a self hosted application that we had to maintain ourselves. This involved periodically rebooting the server due to memory leaks, and applying the infrequent patches and upgrades. Prior to every upgrade we would first test it on a dummy instance, and would have the folks who used the tool extensively to do scheduling and processing time and expenses give it a once-over before we deployed it to production. If there were issues we would work with the vendor to iron them out prior to the deployment. It worked well.
Unfortunately about a year ago they discontinued support and licensing for the self-hosted version and we had to upgrade to their “state of the art” cloud based service. This was nice for me because it meant we no longer had to expend time and resources to maintain the tool internally. The end users were also happy because they would be getting all kinds of new bells and whistles to play with. The vendor promised the cloud version is developed and improved very aggressively based on user suggestions and that their new agile development process can deploy fixes and custom patches much faster than before. It sounded great on paper, but it turned out to be a disaster.
The vendor likes to push out minor updates and patches every other Monday, and like clockwork this results in our ticketing system getting clogged up with timesheet software related issues. We verify all of these and tag-team grouping and compiling them into support requests who get forwarded to the vendor support team, and cc’d to our account manager. This is our third account manager since the switch and I suspect our company single-handedly got the last two fired by maintaining an unstoppable barrage of open tickets and constant demands for discounts and downtime compensation.
Most of the problems we are having stem from trivial “fixes” that make perfect sense if you are on the development team. For example, recently someone noticed that the box you use to specify how many hours you worked can accept negative values. There was no validation so the system wouldn’t even blink if you entered say negative five hours on a Monday. So they went in, added input validation, and just to be on the save side they fixed it in their database. And by fixed, I mean they took an absolute value of the relevant column, and then they changed the datatype to unsigned integer. Because if there were negative values there, they had to be in by a mistake, right? Because who in their right mind would use negative time? Well, it turns out it was my team. Somehow they figured out a way to use this bug to easily fudge time balances on the admin site. For example, if someone was supposed to work five hours on a Monday, but had an emergency and left three hours early, the admin would just go in and add -3 work hours to the timesheet with a comment. It allowed them to have both the record of what the person was supposed to do, and what actually happened. Needless to say, after the “fix” all our reports were wrong.

Ok, so you divide by zero here, put NULL in all these fields, put -1 here, and 1=1 in the date field, click through the segfault message and your report is ready.
More recently, they noticed that there were two ways for people to request time off in the system. You could create a time-off request ahead of time (which had to be approved by a supervisor) or you could submit it as a part your timesheet by putting in 8 hours as “personal day” or whatever. Someone on the vendor’s dev team decided to “streamline” the process and removed the ability to enter time off from the timesheet page. To them it made perfect sense to only have a single system pathway for entering time. Unfortunately my team relied on that functionality. We had a special use case for the hourly contractors which simply required them to record their downtime as “unpaid leave” (don’t ask me why – I did not come up with that). Before they could do that by simply filling out their time sheet. After the upgrade they had to go to the time off tab, and fill out a time of request for every partial day that week, then have that request approved by a supervisor before they could actually submit a timesheet. So their workflow went from clicking on a box and typing in a few numbers to going through 3-5 multi-stage dialog boxes and then waiting for an approval.
To the vendor’s credit, they are addressing most of these problems in a timely manner, and their rapid development cycle means we don’t have to wait long for the patches. They do however have serious issues with feature creep and each “fix” creates three new problems on average.
Majority of these stem from the fact that our users are not using the software the way the developers intended to. They are using the application wrong… But whose fault is that? Should paying customers be punished or even chastised for becoming power users and employing the software in new, emergent ways rather than using it as you imagined they would? Every botched, incomplete or ill conceived UI element or behavior in your software is either an exploit or a power user “feature” in waiting.
I guess the point I’m trying to say is that once you deploy your software into production, and make it available to a non-trivial amount of users, it is no longer yours. From that point on, any “bug fix” can an will affect entire teams of people who rely on it. A shitty feature you’ve been campaigning to remove is probably someone’s favorite thing about your software. A forgotten validation rule is probably some teams “productivity crutch” and they are hopeless without it.
Full test coverage may help to limit the amount of “holes” your users may creatively take advantage of, but it only takes you so far. There is no way to automate testing for something you never anticipated users doing. You won’t even discover these emergent, colorful “power user tricks” by dog-fooding your app, because your team will use it as intended, rather than randomly flail around until they find a sequence of bugs that triggers an interesting side-effect and then make it the core of their workflow. This is something you can only find out if you work with genuine end users that treat your software like a magical, sentient black box that they are a little scared off.
This reminds me of a project people I’ve worked with on my university were working on:
Ali Baba is a java applet from 2005 that displays a graph of genes, proteins, diseases and medications that are mentioned together in medical papers and is fairly bad at it. It was made for medical research (and people needed to show off what an amazing database of text-mined medical papers we have, it really is quite cool). Now you ask yourself why my university still supports a java applet. It is not because there is some old prof sitting and thinking this is the latest shit, contrary this specific work group consists of relative young and very talented people. The reason it’s still up is that every time they take it down they get a few dozens of emails complaining about the discontinuation of the authors favourite mind mapping tool.
I like your definition for “non-trivial number of users.” That really does sound like a perfect metric. If you have 10,000 users and you make a minor breaking change that requires on average 15 minutes for each user to re-learn the user interface, or whatever it takes to deal with the change, that’s a cost of over 100 man-days for that modification. It adds up!
Your problem description can be generalized to all interfaces, human or otherwise. That’s fundamentally why we have dotted version numbers. If version numbers were merely about features and bugs, all we would need is a single integer: “Foobar Pro only has feature X in versions 5 and above.” (I guess this is the trend that browsers follow now.) The major.minor.patch convention primarily exists to describe interface compatibility between versions. Patch increments are backwards-compatible bug fixes. Minor increments are backwards-compatible changes that might introduce new features. Major increments are breaking changes. This is mostly about APIs, though. I don’t think user interfaces are often considered when versioning, but maybe they should be.
We’ve run into this problem a lot recently in the Emacs community with the relatively new package.el. There’s MELPA providing bleeding-edge packages, and these packages sometimes change their interfaces without warning. It’s annoying for me because sometimes it means I have to stop what I’m doing to update my own usage of the package, such as changing which of its functions I call or how I call them. To start working around this, a stable version of MELPA was introduced for hosting stable versions of packages — a repository where changes to interfaces are more conservative. This is sort of like Debian, with its stable, testing, and unstable repositories.
This mirrors some information I gleaned about Microsoft’s plans for Azure – the online version of SQL Server. They are planning to apply regualr patches and fixes directly to the Live cloud environment. So, anyone with an application that uses an Azure database could find it’s legs kicked from under it by a patch change. They won’t have a test environment, it’s “bang” and the change is made. Of course none of this will concern managers who will long gone before the chicken comes home to roost!
Chris Wellons wrote:
Ouch, where I work we’re expected to put in an eight hour day on average. Your place expects you to work every waking moment? ;)
For reference — and this isn’t to be pedantic, but to show how you’re actually underselling your point — a “man-day” is usually defined as a work day, not a 24-hour day. Assuming that mythical eight-hour day, your example actually costs over a man-year. Ouch.@ Jason *StDoodle* Wood:
Just recently I had an issue like that. I’ve been using a certain iOS app for tracking my finances for 3 years now. The new update changed the routine: before each transaction was entered with default date of today. Now each new transaction is added with the date of the previous transaction if that one happened during a previous hour. So if I am going through my bills and enter a bill from a week ago and then enter a bill from today I’d have to change the date both times and also check the date all the time.
I wrote a comment about that on developer’s site and got this email in response: