2020-02-01 - The Elements of Programming Style (1974)
(original .ipynb)
The Elements of Programming Style was written by Brian Kernighan and P.J. Plauger and I've got a lot of respect for both of them. In addition to writing some of the most important software, Kernighan also wrote some books I really enjoyed - The C Programming Language, The AWK Programming Language and The UNIX Programming Environment. Plauger's company "Dinkumware Inc" implemented a C standard library back in the '80s which I worked with a bit back when I was in Analog Devices. I hadn't really heard anything about the book but I'm curious what was considered good practice back in 1974, and how relevant it is for developers in 2019.
Chapter 1 - Introduction
In typical Kernighan fashion, we open with a snippet of code:
DO 14 I=1,N
DO 14 J=1,N
14 V(I,J)=(I/J)*(J/I)
It seems I'll have to read a fair amount of Fortran in this book :-) Anyway this is basically performing something like the below:
for i in range(n):
for j in range(n):
v[i][j] = (i/j) * (j/i)
Which is an obscure way to generate an identity matrix, and one which is a little bit wasteful (two divisions and one multiplication per element). He gives a "better" approach in Fortran (which I'll ignore as it is inscrutable to my non-Fortran-understanding eyes) and in PL/I, which is below:
/* MAKE V AN IDENTITY MATRIX */
V = 0.0;
DO I = 1 TO N;
V(I,I) = 1.0;
END
So the first lesson is pretty clear, and is spelled out explicitly:
Write clearly - don't be too clever.
There's a bit more introducing the various chapters of the book, a couple of examples (another Fortran one - refactoring some code that did a logical test which just skipped over a GOTO
instruction) and a quick explanation that it doesn't matter what language you're using, "although details vary from language to language, the principles of style are the same" - stuff like branching around branches is gonna be confusing or weird in any language.
There are also a dodgy sort and a couple of questions under the title "Points To Ponder" so it seems like a good way to finish each chapter would be to work through these (update: no it is absolutely not, I will not be doing this):
Points To Ponder
1.1 A matrix with n
rows/columns has n^2
elements, initializing it takes n^2
assignments. Multiplying matrices or solving n
linear equations in n
unknowns involves order n^3
operations. Give arguments to support both:
* If n >= 10
the time required to initialize a matrix is not very important
* The time taken to initialize the matrix will pale in comparison to the time taken to perform the computation
* If n < 10
the time required to initialize a matrix is not very important
* The initialization time will be a proportionally larger, but given the small problem size it's not particularly large overall
1.2 Ok the question is basically which version of the two functions below do I prefer:
C Original version
100 READ(5, 110) X
110 FORMAT(F10.0)
IF (X .LT. 0.0) WRITE(6, 120) X
120 FORMAT(1X, 'SQRT(', qPE12.4, ') UNDEFINED')
IF (X .EQ. 0.0) WRITE(6, 130) X, X
130 FORMAT(1X, 'SQRT(', 1PE12.4, ') = ', 1PE12.4)
IF (X .LE. 0.0) GOTO 100
B = X/2.0
200 IF (ABS(X/B - B) .LT. 1.0E-5 * B) GOTO 300
B = (X/B + B) / 2.0
GOTO 200
300 WRITE(6,130) X,B
GOTO 100
END
C Fixed version
10 READ(5, 11) X
11 FORMAT (F10.0)
IF (X .GE. 0.0) GOTO 20
WRITE(6,13) X
13 FORMAT (' SQRT(', 1PE12.5, ') UNDEFINED')
GOTO 10
20 IF (X .GT. 0.0) GOTO 30
B = 0.0
GOTO 50
30 B = 1.0
40 A = B
B = (X/A + A)/2.0
IF (ABS((X/B)/B - 1.0) .GE. 1.0E-5) GOTO 40
50 WRITE(6, 51) X, B
51 FORMAT(' SQRT(', 1PE12.5, ') = ', 1PE12.5)
GOTO 10
END
OK so I really don't like reading Fortran but the line labels in #2 are a bit more sensible, in #1 there's three different comparisons between X and 0 (first <, then ==, then <=) which make it harder to read than #1
1.3 In a square-root routine earlier there's some performed to decide when the newton-raphson approximation is "good enough" - so when computing y = sqrt(x)
it checks if (y*y) - x
is less than the value 1.0e-5
. A slightly different method is suggested that tests against a threshold relative to the size of input:
REAL FUNCTION RELDIF(X,Y)
RELDIF = ABS(X-Y) / AMAX1(ABS(X),ABS(Y))
RETURN
END
The question is - how can it be used in our example implementation and are there any values that could cause problems if this RELDIF
is used?
Instead of the IF (ABS((X/B)/B - 1.0) .GE. 1.0E-5) GOTO DONE
we'd have something like IF (RELDIF(ABS((X/B)/B) <= $tolerance) GOTO DONE
. And there'd definitely be issues with X
and Y
values where X == Y
, but other than that I'm not sure. Some very large but close numbers (i.e. some billion, and the the next representable number) seem like a good candidate to muck around with. So take, say, a million and call something like C++ math
library's nextafter()
function on it - we'll end up with a tiny value we try to divide by a large one.
# quick aside - directly translating the sqrt() from this chapter # into Python results in code which is not very idiomatic threshold = 1e-5 def sqrt_eops_1(x): if x >= 0: if x > 0: b = 1.0 while abs((x/b)/b - 1.0) >= threshold: a = b b = ((x/a) + a) / 2.0 return b return 0 raise Exception("UNDEFINED") print("sqrt_eops_1") print(sqrt_eops_1(9.0)) print(sqrt_eops_1(50.0)) print() # slightly better def sqrt_sean(x): if x < 0: raise Exception("UNDEFINED") if x == 0: return 0 guess = 1.0 while abs((x/guess)/guess - 1.0) >= threshold: guess = ((x / guess) + guess) / 2 return guess print("sqrt_sean") print(sqrt_sean(9.0)) print(sqrt_sean(50.0)) print() # and using the "reldif" function def reldif(x, y): return abs(x-y) / float(max(abs(x), abs(y))) def sqrt_sean_reldif(x): if x == 0: return 0 if x < 0: raise Exception("UNDEFINED") guess = 1.0 while reldif(x, guess * guess) >= threshold: guess = ((x / guess) + guess) / 2 return guess print("sqrt_sean_reldif") print(sqrt_sean_reldif(9)) print(sqrt_sean_reldif(50)) print()
sqrt_eops_1 3.000000001396984 7.071067984011346 sqrt_sean 3.000000001396984 7.071067984011346 sqrt_sean_reldif 3.000000001396984 7.071067984011346
Chapter 2 - Expression
OK much more going on here. A couple of rules come out in quick succession over the course of a handful of frustrating Fortran snippets:
Say what you mean, simply and directly
Use library functions
Avoid temporary variables
Write clearly - don't sacrifice clarity for "efficiency"
These are pretty common sense ... except "Avoid temporary variables". I disagree here. The example given is:
10 F1=X1-X2*X2
F2=1.0-X2
FX=F1*F1+F2*F2
C NOTE THAT IT IS MORE EFFICIENT TO COMPUTE
C F!*F1 THAN TO COMPUTE F1**2
It's suggested that some Fortran compiler might generate faster code with the following:
10 FX = (X1 - X2**2)**2 + (1.0 - X2)**2
... with the following:
This rendition also happens to be more readable and eliminates the temporary variables
F1
andF2
, which have little mnemonic value. The fewer temporary variables in a program, the less chance there is that one will not be properly initialized.
I disagree here - the problem with this example is not the temporary variables, but with what they are named. If they were more appropriately named than F1
and F2
it'd be easier to understand what it actually does and spot errors more easily.
Anyway, more rules:
Let the machine to the dirty work
The example given for this is a weird one, some comment referring to a binary literal included for ... performance reasons?
One of the first services to be automated in early computer languages was the conversion of decimal to binary. It would be a cshame if we were forced to think in binary, after all these years, by misinformed considerations of "efficiency" (Most compilers will convert "47" to binary at compile time, by the way. Those that will not must certainly provider worse inefficiencies to worry about.)
It's really weird to think that there might have been some point where runtime conversion of an int literal to binary was an issue. I am curious if this was a real problem.
We continue with bread-and-butter bad code refactoring:
Replace repetitive expressions by calls to a common function
Parenthesize to avoid ambiguity
I was once asked in a job interview to evaluate in the result of a bunch of expressions (like x*y/10-z%100
) - so they could have saved themselves a bunch of time by asking me "What is the order of precedence for the maths operators in C?"
Choose variable names that won't be confused
... so, like F1
and F2
Here's a fun one:
Avoid the Fortran arithmetic IF
So in addition to a logical IF (IF somethingTrueOrFalse THEN doSomething
) Fortran apparently has an IF statement that operates on an integer and has three components:
- code that's executed if the expression is < 0
- code that's executed if the expression is == 0
- code that's executed if the expression is > 0
Avoid unnecessary branches
Good advice, interesting example. They suggest that instead of looping over all elements of an array A
and doing IF (A[I] .GT. GRVAL) GRVAL = A[I]
, it's better use GRVAL = AMAX1(GRVAL, A(1)
. I think this violates the Write clearly... rule, and if we really care about it in this case I'm sure there's a function like GRVAL = MAX(A)
. Basically I think there's a better example to be made here.
Use the good features of a language; avoid the bad ones
There's actually no good example given here, it's right after a snippet of poorly indented sorting function that abuses GOTO to early exit. Some hate for a (poorly named) temp variable which is IMO misplaced. But it's not clear what the bad/good features are being highlighted (GO TO = bad, of course even back in 1988 I think that was a given).
Don't use conditional branches as a substitute for a logical expression
Yep - I don't think anyone deliberately does this:
if (some_expression):
return True
return False
But I've seen it where a complex method gets reduced, refactored and optimised piecemeal - then left as-is.
Use the "telephone test" for readability
Quite a nice way to think about how understandable code is - if you can read it out loud to someone who can't see it and it makes sense to them then that's good. But code poorly named variables (if m > n and nn > mm ...
) or long complex expressions will rightly fail the telephone test, indicating that it needs to be rewritten slightly.
Chapter 3: Control Structure
It's easy to forget how good we have it these days - quite a lot of the examples we've seen so far involve GOTO abuse and branching in Fortran. I didn't realise why this was. Apparently in older Fortran there's no simple way to have an IF
guard a block of code without jumping round it using a GOTO. Which is why there's things like:
IF (TABLE (NO) .GT. HICOM) GOTO 50
GO TO 20
50 HICOM = TABLE (NO)
NUMBER = NO
20 CONTINUE
Where we'd preferably structure it more like:
IF (TABLE (NO) .GT. HICOM) THEN DO
HICOM = TABLE (NO)
NUMBER = NO
END
Which explains why the next rule is:
Use
DO-END
and indenting to delimit groups of statements
This is second nature to us nowadays. The next suggestion still makes sense however.
IF HRS_WORKED<=40
THEN CALL REGPAY
IF HRS_WORKED>=40
THEN CALL OTPAY
Because we've manually entered the conditions we've accidentally introduced a bug:
IF HRS_WORKED<=40
THEN CALL REGPAY;
ELSE
THEN CALL OTPAY;
So,
Use
IF-ELSE
to emphsize that only one of two actions is to be performed
Going back to the old-school coding, there's a few examples of loops constructed through IF
/GOTO
which might have been common in Fortran (definitely in Assembly programming, of course) hence the inclusion of:
Use
DO
andDO-WHILE
to emphasize the presence of loops
Next we arrive at a solid rule in a weird way:
Make your programs read from top to bottom
We have a function that has a goto-based loop which calculates bowling scores, it sort of jumps all over the place, and has a RETURN
statement near but not at the end. The resulting code is more readable, but I don't think it's because of the rule. I think it's because we've restructured using DO
, we've followed the "Say what you mean..." rule and not tried to be too clever. The top-to-bottom rule for me is more appropriate if we have, say, a class that represents a REST endpoint. It might contain a number of functions, but you'd expect it to have structure something like:
class MyApiController(ApiController):
def post(self, payload):
foo = self._foo_a_thing(payload.foo_thing)
bar = self._bar_a_thing(payload.bar_thing)
# ...
def _foo_a_thing(self, thing):
# return the foo'd thing
def _bar_a_thing(self, thing):
# return the bar'd thing
You start reading the class definition, immediately see the main method and can fill in the blanks later. You don't need to hunt around for it, and to me at least it makes sense to get a general idea of what a method does before investigating everything it might call.
Next up we have the following, which maybe could've been covered by the original IF-ELSE
if it had been a bit more general:
Use
IF ... ELSE IF ... ELSE IF ... ELSE
to implement multi-way branches
A couple of control flow constructs are defined as "fundamental" and map to the elements described by Dijkstra in "Structured Programming":
- statement grouping with, for example
DO-END
orBEGIN-END
- decision making with
IF-ELSE
- looping with
DO
andDO-WHILE
- subroutines, functions or procedures
Use the fundamental control flow constructs
So ... yeah, use these? And a couple of pages later:
Write first in an easy-to-understand pseudo-language then translate into whatever language you have to use
... is a pretty common one - whether it's actual writing something out in pseudocode or setting it out in actual code but spelling out the steps with stub functions and comments.
There's some good ones though:
Avoid
THEN-IF
and nullELSE
AvoidELSE GOTO
andELSE RETURN
There's a snippet which computes the "effective weight" of an aeroplane, which is reworked to avoid these anti-patterns. I don't want to type out the Fortran, so here it is in Python in the form of a function bad_effective_weight()
:
def bad_effective_weight(weight, length, wingspan): if length >= 30 and length <= 50: if wingspan < 0.6 * length: return (1 + 0.08 - 0.037) * weight else: return (1 + 0.08 + 0.045) * weight elif length > 50 and length < 60: if wingspan < 0.6 * length: return (1 + 0.09 - 0.037) * weight else: return (1 + 0.09 + 0.045) * weight elif length > 60 and length < 80: if wingspan < 0.6 * length: return (1 + 0.105 - 0.037) * weight else: return (1 + 0.105 + 0.045) * weight else: if wingspan < 0.6 * length: return (1 + 0.122 - 0.037) * weight else: return (1 + 0.122 + 0.045) * weight
So it's kind of a mess of if/elif branches, repeated code and constants that don't make much sense.
- we switch on
wing < 0.6 * length
, and adjust the between the multiplier by +0.037 and -0.045 accordingly - we adjust the multiplier by either 0.08, 0.09, 0.105 or 0.122 according to the
length
- the case where
length == 60
accidentally falls through to same code that handleslength > 80
- the case where
length < 30
falls through to this same code
I'd do this differently - see the improved_effective_weight
below:
def improved_effective_weight(weight, length, wingspan): adjustment_factor = 1 + wing_length_adjustment(wingspan, length) + absolute_length_adjustment(length) return weight * adjustment_factor def wing_length_adjustment(wingspan, length): if wingspan < 0.6 * length: return -0.037 return 0.045 def absolute_length_adjustment(length): if length > 80: return 0.122 elif length > 60: return 0.105 elif length > 50: return 0.09 elif length > 30: return 0.08 return 0
Here, I've:
- pulled out the selection of the adjustments into separate, clearly named functions
- made the assumption that
length <= 30
has no adjustment - handled the errant
length == 60
by consistently comparing using>
and not a mixture of>
/>=
/<
/<=
- which is just asking for an annoying off-by-one error
This lets us easily see that the effective weight is computed by multiplying by the weight by a factor that adjusts depending on the length of the plane and its length (this is even more obfuscated in the original Fortran). It is important to note that the assumptions I have made are likely wrong. However now we have easy-to-understand code that can be reviewed by the aerospace engineer involved (who likely cannot code). Additionally the selection of the absolute length adjustment seems very course-grained, it's very likely that this could be replaced with a calculation instead that's more accurate.
Next one was familiar and not-quite what I thought:
Follow each decision as closely as possible with its associated action
This reminded me of the principle of Code Complete where McConnell would measure a variable's declaration (or was it initialization?) with it's use - the idea being the closer the distance the more readbale the code is and the less chance the variable would be misused. However it was actually was still referring to IFs and GOTOs. This is getting really tiresome, if I'm honest and I'm not enjoying the book as I much as I thought I would. Especially since we keep seeing snippets of Fortran which have a defect. As an example, let's take a look at the change calculating program (translated into Python from Fortran, of course):
def calc_change_bad(diff): a = [20.0, 10.0, 5.0, 1.0, 0.25, 0.1, 0.05, 0.01] for i in range(8): nt = 0 for j in range(1, 51): if (diff / (j * a[i])) < 1: break nt = j if (nt > 0): diff = diff - (nt * a[i]) if i == 0: print(nt, "twenty dollar bills") if i == 1: print(nt, "ten dollar bills") if i == 2: print(nt, "five dollar bills") if i == 3: print(nt, "one dollar bills") if i == 4: print(nt, "quarters") if i == 5: print(nt, "dimes") if i == 6: print(nt, "nickels") if i == 7: print(nt, "pennies") calc_change_bad(22.50)
1 twenty dollar bills 2 one dollar bills 2 quarters
Of course this is horrible:
- there are repeated
if
statements - it's hard to read - the names are odd and the
break
sucks (I added the break in lieu of aGOTO
in the original) - there's a bug for values of
diff
over $1820.50 - it's not easy to add a new denomination, so if we need to handle a $50 or we need to shuffle all the
i
values
def improved_calc_change(value): cash_values = [20.0, 10.0, 5.0, 1.0, 0.25, 0.1, 0.05, 0.01] cash_names = [ "twenty dollar bills" , "ten dollar bills", "five dollar bills", "one dollar bills", "quarters", "dimes", "nickels", "pennies" ] for i, v in enumerate(cash_values): qty = int(value / v) if qty > 1: print(qty, cash_names[i]) value -= qty * cash_values[i] improved_calc_change(22.50)
2 ten dollar bills 2 one dollar bills 2 quarters
This isn't even particularly good - it's just a quick solution written in a way that could be easily implemented in Fortran or PL/I. Anyway this is a demonstratration of the next couple of principles:
Use data arrays to avoid repetitive control sequences
Choose a data representation that makes the program simple
But there's a nice example and a nice lesson - some code (to convert year/day-of-year to month/day-of-month) is incrementally reworked to be more readable and more maintainable with the following conclusion:
Don't stop with your first draft.
Good message to end the chapter on.
Chapter 4. Program Structure
The 1980s must have been wild - the code I've read so far has been kinda painful, and the first guidelines we see in this chapter would be considered quite elementary nowadays:
Modularize. Use subroutines.
Make the coupling between modules visible.
The second in particular is quite interesting - the surrounding text suggests against relying on global state (so if it was written a few years later I think it might talk about pure functions).
Each module should do one thing well
Single Responsibility!
Make sure every module hides something
... ok so I guess "don't make useless arbitrary classes"
Let the data structure the program
This is very similar to some stuff I'd read recently - specifically I liked an example in one of Yaron Minsky's articles about using types in Ocaml to model socket connections.
Don't patch bad code - rewrite it
Depending on the change, sure. I don't think anyone would argue this is universal.
Write and test a big program in small pieces
This is really cool, well before the big push for Unit Testing there's a section on writing testable code!
Use recursive procedures for recursively-defined data structures
I think they're saying that recursion makes sense to model the solution some problems, but not necessarily to implement them.
Chapter 5. Input and Output
We open with an anecdote about how someone accidentally entering an erroneous "P" character when inputting the valuation of a car, so it was incorrectly interpreted as being worth USD 7,000,950 instead of USD 950.
Test input for validity and plausability
Make sure input cannot violate the limits of the program
Terminate input by end-of-file or marker, not by count
Identify bad input; recover if possible
Treat end of file conditions in a uniform manner
Make input easy to prepare and output self-explanatory
Use uniform input formats
Make input easy to proofread
Use free-form input when possible
Use self-identifying input. Allow defaults. Echo both on output.
Localize input and output in subroutines
Chapter 6. Common Blunders
Make sure all variables are initialized before use.
Don't stop at one bug
Yeah well particularly when you're looking at the code snippets in this book :-(
Use debugging compilers
Or "linters"
Initialize constants with
DATA
statements orINITIAL
attributes; initialize variables with executable code
... what?
Watch out for off-by-one errors
Maybe slightly less of an issue nowadays - I imagine many problems came from for (i=0; i<10; i++) { do_stuff(items[i]); }
and basically everyone has an equivalent of for item in items: do_stuff(item)
. Still a thing though.
Take care to branch the right way on equality
Avoid multiple exits from loops.
Make sure your code "does nothing" gracefully
Test programs at their boundary values
Program defensively
10.0 times 0.1 is hardly ever 1.0
Don't compare floating point numbers just for equality
Chapter 7. Efficiency and Instrumentation
OK this one worked out quite well. This is actually a much better than the usual rules of optimisation:
- Don't.
- Don't Yet (for experts only).
Which is funny but not instructive or helpful. Let's get into it...
Keep it right when you make it faster
Make it clear before you make it faster
These go hand-in-hand as exercises to be performed before any attempt at optimisation is performed. It'll make you understand the nuances of the code better (so you won't accidentally "optimise" away something important) and it'll help to identify any potential low hanging fruit - easy and relatively safe performance gains.
Don't sacrifice clarity for small gains in "efficiency"
It's rare for a tiny performance boost to be worth the cost of maintaining and working with code that has a weird or complex optimisation.
Let your compiler do the simple optimisations
I did not expect to see this in a book from 1974! There's mention of optimising compilers that can hoist operations out of a loop. But the general idea idea is.
Don't strain to re-use code; reorganize instead
OK so the example has a really unusual attempt to re-use some PUT...
statement, and happens to be utterly inscrutable. This one feels a little forced, or doesn't belong in this section (maybe the Program Structure chapter instead)
Make sure special cases are truly special
OK my take on this is "Don't just add if...
branches to make your program behave correctly"
Keep it simple to make it faster
Don't diddle code to make it faster - find a better algorithm
So, not always possible but generally good to keep in mind.
Instrument your programs. Measure before making "efficiency changes.
Yep - you don't know if you've made anything better if you can't measure it.
Chapter 8. Documentation
Again this section is a nice departure with pretty sound, timeless advice. More weird Fortran however.
Don't just echo the code with comments - make every comment count
Takes me back a bit to my days learning C at school:
// add ten to i
i += 10;
Don't comment bad code - rewrite it
Once more, not exactly true in every case - but if you find yourself writing /* This is hard to follow, but what's happening is ... */
it's generally a bad sign
Use variable names that mean something
(Painfully absent from most of the snippets)
Use statement labels that mean something
(Ok we don't usually use labels/gotos anymore)
Format a program to help the reader understand it
Indent to show the logical structure of a program
Document your data layouts
Don't over-comment
(kinda repeating the "Don't just echo the code ..." one, but sure)
Conclusion
I really chose this poorly as the first book I was gonna keep notes on - I didn't find a lot of things I really needed to think about. Normally I like it when a programming book has lots of code snippets to pick through and understand. But the PL/I and Fortran was truly painful, I wasn't learning anything by reading them and found myself deliberately not packing my Kindle in the morning so I'd have to find something else to read on the Tram to and from work.
Chapters 7 and 8 still hold up today, however. I wonder what a modern version of this book would look like, what languages it'd use and who would write it. Maybe similar to Code Complete?
I really did not know the level of notes to keep, and it felt kinda stupid to have a note for every one of the programming guidelines.