It is quite common to see software that does not scales well.
Software that works well on a developer machine, and suddenly it start crawling when using real data.
I have seen it in my teams, and usually it escalates to a critical emergency: all other development stops, until a solution is found, if any.
Use real-size test data for development
Development is usually performed on ad-hoc databases, with few items created just for basic testing.
While this may be a useful tool for correctness tests, you will hardly ever find performance issues. Ok, you may anticipate them and add index to database at will, but there is no general recipe for performance; and without a good data set, such issues sooner or later will appear.
Only testing with real-size data allows for performance findings.
Note that real data means also real use cases, with target performance: good performance is only useful as long as customer appreciate it, otherwise is a worthless effort.
Therefore, once equipped with a real data sets and scenarios, start profiling. Measure the performance, and make a plan. Where is the bottleneck? Is it possible to optimize that part? What is the current gap in comparison to the target?
Assess complexity of every class and methods
While the former approach is correct, I think we should also make a small pro actively step back, at our programming courses, when we have been taught about algorithm correctness and complexity, and how to assess it. Yep, this is a kind of lost knowledge: with the notable exception of STL, it is quite uncommon to see such analysis in the code or in its documentation.
This should not be an obstacle though, and a very useful tool for tracking complexity is the so called “Big-O notation“.
Wikipedia has quite a good explanation, but roughly speaking, this notation gives an idea of algorithm complexity as function of the size of the input data. For instance, a linear algorithm will require about double of the time to complete for a double set of data, while a quadratic one will require four times, and a constant one will return a result in a constant time, no matter the size or the kind of the input data.
It is an invaluable tool for assessing algorithm complexity; unfortunately it is not always so easy to use: a good analysis may involve calculation of series, and special thinking on statistical data; but in many cases, even a wrong, pessimistic, calculation is already a good starting point.
Of the various complexity classes (again, take a look at Wikipedia article for a complete listing) the most important are (from “faster” to “slower”):
- Constant
- Logarithmic
- Linear
- Quadratic
- Polynomial
- Exponential
Generally speaking, you should try to avoid everything more complex than linear. Of course, this is not always possible, but then you should beware large sets of data.
Be humble
When your application is slow, don’t blame the operating system, the language or third party libraries.
Don’t even say “it is an heavy calculation, hence it must be slow”.
You can say whatever you want, but your application will still be slow.
Instead, assess performance, ask for a target, and start profiling. Analyze and then try to remove unnecessary complexities.
Be humble, and fix your errors.