Sean McLemon

2020-02-01 - The Elements of Programming Style (1974)

(original .ipynb)

The Elements of Programming Style was written by Brian Kernighan and P.J. Plauger and I've got a lot of respect for both of them. In addition to writing some of the most important software, Kernighan also wrote some books I really enjoyed - The C Programming Language, The AWK Programming Language and The UNIX Programming Environment. Plauger's company "Dinkumware Inc" implemented a C standard library back in the '80s which I worked with a bit back when I was in Analog Devices. I hadn't really heard anything about the book but I'm curious what was considered good practice back in 1974, and how relevant it is for developers in 2019.

Chapter 1 - Introduction

In typical Kernighan fashion, we open with a snippet of code:

   DO 14 I=1,N
   DO 14 J=1,N
14 V(I,J)=(I/J)*(J/I)

It seems I'll have to read a fair amount of Fortran in this book :-) Anyway this is basically performing something like the below:

for i in range(n):
    for j in range(n):
        v[i][j] = (i/j) * (j/i)

Which is an obscure way to generate an identity matrix, and one which is a little bit wasteful (two divisions and one multiplication per element). He gives a "better" approach in Fortran (which I'll ignore as it is inscrutable to my non-Fortran-understanding eyes) and in PL/I, which is below:

/* MAKE V AN IDENTITY MATRIX */
     V = 0.0; 
     DO I = 1 TO N;
        V(I,I) = 1.0;
     END

So the first lesson is pretty clear, and is spelled out explicitly:

Write clearly - don't be too clever.

There's a bit more introducing the various chapters of the book, a couple of examples (another Fortran one - refactoring some code that did a logical test which just skipped over a GOTO instruction) and a quick explanation that it doesn't matter what language you're using, "although details vary from language to language, the principles of style are the same" - stuff like branching around branches is gonna be confusing or weird in any language.

There are also a dodgy sort and a couple of questions under the title "Points To Ponder" so it seems like a good way to finish each chapter would be to work through these (update: no it is absolutely not, I will not be doing this):

Points To Ponder

1.1 A matrix with n rows/columns has n^2 elements, initializing it takes n^2 assignments. Multiplying matrices or solving n linear equations in n unknowns involves order n^3 operations. Give arguments to support both: * If n >= 10 the time required to initialize a matrix is not very important * The time taken to initialize the matrix will pale in comparison to the time taken to perform the computation * If n < 10 the time required to initialize a matrix is not very important * The initialization time will be a proportionally larger, but given the small problem size it's not particularly large overall

1.2 Ok the question is basically which version of the two functions below do I prefer:

C  Original version
100 READ(5, 110) X
110    FORMAT(F10.0)
    IF (X .LT. 0.0) WRITE(6, 120) X
120    FORMAT(1X, 'SQRT(', qPE12.4, ') UNDEFINED')
    IF (X .EQ. 0.0) WRITE(6, 130) X, X
130    FORMAT(1X, 'SQRT(', 1PE12.4, ') = ', 1PE12.4)
    IF (X .LE. 0.0) GOTO 100
    B = X/2.0
200 IF (ABS(X/B - B) .LT. 1.0E-5 * B) GOTO 300
    B = (X/B + B) / 2.0
    GOTO 200
300 WRITE(6,130) X,B
    GOTO 100
    END

C  Fixed version
10 READ(5, 11) X
11 FORMAT (F10.0)
   IF (X .GE. 0.0) GOTO 20
      WRITE(6,13) X
13    FORMAT (' SQRT(', 1PE12.5, ') UNDEFINED')
      GOTO 10
20 IF (X .GT. 0.0) GOTO 30
      B = 0.0
      GOTO 50
30 B = 1.0
40    A = B
      B = (X/A + A)/2.0
      IF (ABS((X/B)/B - 1.0) .GE. 1.0E-5) GOTO 40
50 WRITE(6, 51) X, B
51 FORMAT(' SQRT(', 1PE12.5, ') = ', 1PE12.5)
   GOTO 10
   END

OK so I really don't like reading Fortran but the line labels in #2 are a bit more sensible, in #1 there's three different comparisons between X and 0 (first <, then ==, then <=) which make it harder to read than #1

1.3 In a square-root routine earlier there's some performed to decide when the newton-raphson approximation is "good enough" - so when computing y = sqrt(x) it checks if (y*y) - x is less than the value 1.0e-5. A slightly different method is suggested that tests against a threshold relative to the size of input:

REAL FUNCTION RELDIF(X,Y)
RELDIF = ABS(X-Y) / AMAX1(ABS(X),ABS(Y))
RETURN
END

The question is - how can it be used in our example implementation and are there any values that could cause problems if this RELDIF is used?

Instead of the IF (ABS((X/B)/B - 1.0) .GE. 1.0E-5) GOTO DONE we'd have something like IF (RELDIF(ABS((X/B)/B) <= $tolerance) GOTO DONE. And there'd definitely be issues with X and Y values where X == Y, but other than that I'm not sure. Some very large but close numbers (i.e. some billion, and the the next representable number) seem like a good candidate to muck around with. So take, say, a million and call something like C++ math library's nextafter() function on it - we'll end up with a tiny value we try to divide by a large one.

# quick aside - directly translating the sqrt() from this chapter 
# into Python results in code which is not very idiomatic

threshold = 1e-5

def sqrt_eops_1(x):
    if x >= 0:
        if x > 0:
            b = 1.0
            while abs((x/b)/b - 1.0) >= threshold:
                a = b
                b = ((x/a) + a) / 2.0
            return b
        return 0
    raise Exception("UNDEFINED")
    
print("sqrt_eops_1")
print(sqrt_eops_1(9.0))
print(sqrt_eops_1(50.0))
print()

# slightly better
def sqrt_sean(x):
    if x < 0:
        raise Exception("UNDEFINED")
        
    if x == 0:
        return 0
        
    guess = 1.0
    while abs((x/guess)/guess - 1.0) >= threshold:
        guess = ((x / guess) + guess) / 2
        
    return guess

print("sqrt_sean")
print(sqrt_sean(9.0))
print(sqrt_sean(50.0))
print()

# and using the "reldif" function
def reldif(x, y):
    return abs(x-y) / float(max(abs(x), abs(y)))

def sqrt_sean_reldif(x):
    if x == 0:
        return 0
    
    if x < 0:
        raise Exception("UNDEFINED")
        
    guess = 1.0
    while reldif(x, guess * guess) >= threshold:
        guess = ((x / guess) + guess) / 2
        
    return guess

print("sqrt_sean_reldif")
print(sqrt_sean_reldif(9))
print(sqrt_sean_reldif(50))
print()

sqrt_eops_1
3.000000001396984
7.071067984011346

sqrt_sean
3.000000001396984
7.071067984011346

sqrt_sean_reldif
3.000000001396984
7.071067984011346

Chapter 2 - Expression

OK much more going on here. A couple of rules come out in quick succession over the course of a handful of frustrating Fortran snippets:

Say what you mean, simply and directly

Use library functions

Avoid temporary variables

Write clearly - don't sacrifice clarity for "efficiency"

These are pretty common sense ... except "Avoid temporary variables". I disagree here. The example given is:

    10 F1=X1-X2*X2
       F2=1.0-X2
       FX=F1*F1+F2*F2
C NOTE THAT IT IS MORE EFFICIENT TO COMPUTE
C F!*F1 THAN TO COMPUTE F1**2

It's suggested that some Fortran compiler might generate faster code with the following:

    10 FX = (X1 - X2**2)**2 + (1.0 - X2)**2

... with the following:

This rendition also happens to be more readable and eliminates the temporary variables F1 and F2, which have little mnemonic value. The fewer temporary variables in a program, the less chance there is that one will not be properly initialized.

I disagree here - the problem with this example is not the temporary variables, but with what they are named. If they were more appropriately named than F1 and F2 it'd be easier to understand what it actually does and spot errors more easily.

Anyway, more rules:

Let the machine to the dirty work

The example given for this is a weird one, some comment referring to a binary literal included for ... performance reasons?

One of the first services to be automated in early computer languages was the conversion of decimal to binary. It would be a cshame if we were forced to think in binary, after all these years, by misinformed considerations of "efficiency" (Most compilers will convert "47" to binary at compile time, by the way. Those that will not must certainly provider worse inefficiencies to worry about.)

It's really weird to think that there might have been some point where runtime conversion of an int literal to binary was an issue. I am curious if this was a real problem.

We continue with bread-and-butter bad code refactoring:

Replace repetitive expressions by calls to a common function

Parenthesize to avoid ambiguity

I was once asked in a job interview to evaluate in the result of a bunch of expressions (like x*y/10-z%100) - so they could have saved themselves a bunch of time by asking me "What is the order of precedence for the maths operators in C?"

Choose variable names that won't be confused

... so, like F1 and F2

Here's a fun one:

Avoid the Fortran arithmetic IF

So in addition to a logical IF (IF somethingTrueOrFalse THEN doSomething) Fortran apparently has an IF statement that operates on an integer and has three components:

code that's executed if the expression is < 0
code that's executed if the expression is == 0
code that's executed if the expression is > 0

Avoid unnecessary branches

Good advice, interesting example. They suggest that instead of looping over all elements of an array A and doing IF (A[I] .GT. GRVAL) GRVAL = A[I], it's better use GRVAL = AMAX1(GRVAL, A(1). I think this violates the Write clearly... rule, and if we really care about it in this case I'm sure there's a function like GRVAL = MAX(A). Basically I think there's a better example to be made here.

Use the good features of a language; avoid the bad ones

There's actually no good example given here, it's right after a snippet of poorly indented sorting function that abuses GOTO to early exit. Some hate for a (poorly named) temp variable which is IMO misplaced. But it's not clear what the bad/good features are being highlighted (GO TO = bad, of course even back in 1988 I think that was a given).

Don't use conditional branches as a substitute for a logical expression

Yep - I don't think anyone deliberately does this:

if (some_expression):
    return True
return False

But I've seen it where a complex method gets reduced, refactored and optimised piecemeal - then left as-is.

Use the "telephone test" for readability

Quite a nice way to think about how understandable code is - if you can read it out loud to someone who can't see it and it makes sense to them then that's good. But code poorly named variables (if m > n and nn > mm ...) or long complex expressions will rightly fail the telephone test, indicating that it needs to be rewritten slightly.

Chapter 3: Control Structure

It's easy to forget how good we have it these days - quite a lot of the examples we've seen so far involve GOTO abuse and branching in Fortran. I didn't realise why this was. Apparently in older Fortran there's no simple way to have an IF guard a block of code without jumping round it using a GOTO. Which is why there's things like:

    IF (TABLE (NO) .GT. HICOM) GOTO 50
    GO TO 20
50  HICOM = TABLE (NO)
    NUMBER = NO
20  CONTINUE

Where we'd preferably structure it more like:

    IF (TABLE (NO) .GT. HICOM) THEN DO
        HICOM = TABLE (NO)
        NUMBER = NO
    END

Which explains why the next rule is:

Use DO-END and indenting to delimit groups of statements

This is second nature to us nowadays. The next suggestion still makes sense however.

IF HRS_WORKED<=40
    THEN CALL REGPAY
IF HRS_WORKED>=40
    THEN CALL OTPAY

Because we've manually entered the conditions we've accidentally introduced a bug:

IF HRS_WORKED<=40
    THEN CALL REGPAY;
ELSE
    THEN CALL OTPAY;

So,

Use IF-ELSE to emphsize that only one of two actions is to be performed

Going back to the old-school coding, there's a few examples of loops constructed through IF/GOTO which might have been common in Fortran (definitely in Assembly programming, of course) hence the inclusion of:

Use DO and DO-WHILE to emphasize the presence of loops

Next we arrive at a solid rule in a weird way:

Make your programs read from top to bottom

We have a function that has a goto-based loop which calculates bowling scores, it sort of jumps all over the place, and has a RETURN statement near but not at the end. The resulting code is more readable, but I don't think it's because of the rule. I think it's because we've restructured using DO, we've followed the "Say what you mean..." rule and not tried to be too clever. The top-to-bottom rule for me is more appropriate if we have, say, a class that represents a REST endpoint. It might contain a number of functions, but you'd expect it to have structure something like:

class MyApiController(ApiController):
    def post(self, payload):
        foo = self._foo_a_thing(payload.foo_thing)
        bar = self._bar_a_thing(payload.bar_thing)
        # ...

    def _foo_a_thing(self, thing):
        # return the foo'd thing

    def _bar_a_thing(self, thing):
        # return the bar'd thing

You start reading the class definition, immediately see the main method and can fill in the blanks later. You don't need to hunt around for it, and to me at least it makes sense to get a general idea of what a method does before investigating everything it might call.

Next up we have the following, which maybe could've been covered by the original IF-ELSE if it had been a bit more general:

Use IF ... ELSE IF ... ELSE IF ... ELSE to implement multi-way branches

A couple of control flow constructs are defined as "fundamental" and map to the elements described by Dijkstra in "Structured Programming":

statement grouping with, for example DO-END or BEGIN-END
decision making with IF-ELSE
looping with DO and DO-WHILE
subroutines, functions or procedures

Use the fundamental control flow constructs

So ... yeah, use these? And a couple of pages later:

Write first in an easy-to-understand pseudo-language then translate into whatever language you have to use

... is a pretty common one - whether it's actual writing something out in pseudocode or setting it out in actual code but spelling out the steps with stub functions and comments.

There's some good ones though:

Avoid THEN-IF and null ELSE Avoid ELSE GOTO and ELSE RETURN

There's a snippet which computes the "effective weight" of an aeroplane, which is reworked to avoid these anti-patterns. I don't want to type out the Fortran, so here it is in Python in the form of a function bad_effective_weight():

def bad_effective_weight(weight, length, wingspan):
    if length >= 30 and length <= 50:
        if wingspan < 0.6 * length:
            return (1 + 0.08 - 0.037) * weight
        else:
            return (1 + 0.08 + 0.045) * weight
    elif length > 50 and length < 60:
        if wingspan < 0.6 * length:
            return (1 + 0.09 - 0.037) * weight
        else:
            return (1 + 0.09 + 0.045) * weight
    elif length > 60 and length < 80:
        if wingspan < 0.6 * length:
            return (1 + 0.105 - 0.037) * weight
        else:
            return (1 + 0.105 + 0.045) * weight
    else:
        if wingspan < 0.6 * length:
            return (1 + 0.122 - 0.037) * weight
        else:
            return (1 + 0.122 + 0.045) * weight

So it's kind of a mess of if/elif branches, repeated code and constants that don't make much sense.

we switch on wing < 0.6 * length, and adjust the between the multiplier by +0.037 and -0.045 accordingly
we adjust the multiplier by either 0.08, 0.09, 0.105 or 0.122 according to the length
the case where length == 60 accidentally falls through to same code that handles length > 80
the case where length < 30 falls through to this same code

I'd do this differently - see the improved_effective_weight below:

def improved_effective_weight(weight, length, wingspan):
    adjustment_factor = 1 + wing_length_adjustment(wingspan, length) + absolute_length_adjustment(length)
    return weight * adjustment_factor

def wing_length_adjustment(wingspan, length):
    if wingspan < 0.6 * length:
        return -0.037
    return  0.045

def absolute_length_adjustment(length):
    if length > 80:
        return 0.122
    elif length > 60:
        return 0.105
    elif length > 50:
        return 0.09
    elif length > 30:
        return 0.08
    return 0

Here, I've:

pulled out the selection of the adjustments into separate, clearly named functions
made the assumption that length <= 30 has no adjustment
handled the errant length == 60 by consistently comparing using > and not a mixture of >/>=/</<= - which is just asking for an annoying off-by-one error

This lets us easily see that the effective weight is computed by multiplying by the weight by a factor that adjusts depending on the length of the plane and its length (this is even more obfuscated in the original Fortran). It is important to note that the assumptions I have made are likely wrong. However now we have easy-to-understand code that can be reviewed by the aerospace engineer involved (who likely cannot code). Additionally the selection of the absolute length adjustment seems very course-grained, it's very likely that this could be replaced with a calculation instead that's more accurate.

Next one was familiar and not-quite what I thought:

Follow each decision as closely as possible with its associated action

This reminded me of the principle of Code Complete where McConnell would measure a variable's declaration (or was it initialization?) with it's use - the idea being the closer the distance the more readbale the code is and the less chance the variable would be misused. However it was actually was still referring to IFs and GOTOs. This is getting really tiresome, if I'm honest and I'm not enjoying the book as I much as I thought I would. Especially since we keep seeing snippets of Fortran which have a defect. As an example, let's take a look at the change calculating program (translated into Python from Fortran, of course):

def calc_change_bad(diff):
    a = [20.0, 10.0, 5.0, 1.0, 0.25, 0.1, 0.05, 0.01]

    for i in range(8):
        nt = 0
        for j in range(1, 51):
            if (diff / (j * a[i])) < 1:
                break
            nt = j
        if (nt > 0):        
            diff = diff - (nt * a[i])
            if i == 0:
                print(nt, "twenty dollar bills")
            if i == 1:
                print(nt, "ten dollar bills")
            if i == 2: 
                print(nt, "five dollar bills")
            if i == 3:
                print(nt, "one dollar bills")
            if i == 4:
                print(nt, "quarters")
            if i == 5:
                print(nt, "dimes")
            if i == 6:
                print(nt, "nickels")
            if i == 7:
                print(nt, "pennies")
                
calc_change_bad(22.50)

1 twenty dollar bills
2 one dollar bills
2 quarters

Of course this is horrible:

there are repeated if statements
it's hard to read - the names are odd and the break sucks (I added the break in lieu of a GOTO in the original)
there's a bug for values of diff over $1820.50
it's not easy to add a new denomination, so if we need to handle a $50 or we need to shuffle all the i values

def improved_calc_change(value):
    cash_values = [20.0, 10.0, 5.0, 1.0, 0.25, 0.1, 0.05, 0.01]
    cash_names = [ 
        "twenty dollar bills" , "ten dollar bills", "five dollar bills", "one dollar bills",
        "quarters", "dimes", "nickels", "pennies"
    ]
    for i, v in enumerate(cash_values):
        qty = int(value / v)
        if qty > 1:
            print(qty, cash_names[i])
            value -= qty * cash_values[i]

improved_calc_change(22.50)

2 ten dollar bills
2 one dollar bills
2 quarters

This isn't even particularly good - it's just a quick solution written in a way that could be easily implemented in Fortran or PL/I. Anyway this is a demonstratration of the next couple of principles:

Use data arrays to avoid repetitive control sequences

Choose a data representation that makes the program simple

But there's a nice example and a nice lesson - some code (to convert year/day-of-year to month/day-of-month) is incrementally reworked to be more readable and more maintainable with the following conclusion:

Don't stop with your first draft.

Good message to end the chapter on.

Chapter 4. Program Structure

The 1980s must have been wild - the code I've read so far has been kinda painful, and the first guidelines we see in this chapter would be considered quite elementary nowadays:

Modularize. Use subroutines.

Make the coupling between modules visible.

The second in particular is quite interesting - the surrounding text suggests against relying on global state (so if it was written a few years later I think it might talk about pure functions).

Each module should do one thing well

Single Responsibility!

Make sure every module hides something

... ok so I guess "don't make useless arbitrary classes"

Let the data structure the program

This is very similar to some stuff I'd read recently - specifically I liked an example in one of Yaron Minsky's articles about using types in Ocaml to model socket connections.

Don't patch bad code - rewrite it

Depending on the change, sure. I don't think anyone would argue this is universal.

Write and test a big program in small pieces

This is really cool, well before the big push for Unit Testing there's a section on writing testable code!

Use recursive procedures for recursively-defined data structures

I think they're saying that recursion makes sense to model the solution some problems, but not necessarily to implement them.

Chapter 5. Input and Output

We open with an anecdote about how someone accidentally entering an erroneous "P" character when inputting the valuation of a car, so it was incorrectly interpreted as being worth USD 7,000,950 instead of USD 950.

Test input for validity and plausability

Make sure input cannot violate the limits of the program

Terminate input by end-of-file or marker, not by count

Identify bad input; recover if possible

Treat end of file conditions in a uniform manner

Make input easy to prepare and output self-explanatory

Use uniform input formats

Make input easy to proofread

Use free-form input when possible

Use self-identifying input. Allow defaults. Echo both on output.

Localize input and output in subroutines

Chapter 6. Common Blunders

Make sure all variables are initialized before use.

Don't stop at one bug

Yeah well particularly when you're looking at the code snippets in this book :-(

Use debugging compilers

Or "linters"

Initialize constants with DATA statements or INITIAL attributes; initialize variables with executable code

... what?

Watch out for off-by-one errors

Maybe slightly less of an issue nowadays - I imagine many problems came from for (i=0; i<10; i++) { do_stuff(items[i]); } and basically everyone has an equivalent of for item in items: do_stuff(item). Still a thing though.

Take care to branch the right way on equality

Avoid multiple exits from loops.

Make sure your code "does nothing" gracefully

Test programs at their boundary values

Program defensively

10.0 times 0.1 is hardly ever 1.0

Don't compare floating point numbers just for equality

Chapter 7. Efficiency and Instrumentation

OK this one worked out quite well. This is actually a much better than the usual rules of optimisation:

Don't.
Don't Yet (for experts only).

Which is funny but not instructive or helpful. Let's get into it...

Keep it right when you make it faster

Make it clear before you make it faster

These go hand-in-hand as exercises to be performed before any attempt at optimisation is performed. It'll make you understand the nuances of the code better (so you won't accidentally "optimise" away something important) and it'll help to identify any potential low hanging fruit - easy and relatively safe performance gains.

Don't sacrifice clarity for small gains in "efficiency"

It's rare for a tiny performance boost to be worth the cost of maintaining and working with code that has a weird or complex optimisation.

Let your compiler do the simple optimisations

I did not expect to see this in a book from 1974! There's mention of optimising compilers that can hoist operations out of a loop. But the general idea idea is.

Don't strain to re-use code; reorganize instead

OK so the example has a really unusual attempt to re-use some PUT... statement, and happens to be utterly inscrutable. This one feels a little forced, or doesn't belong in this section (maybe the Program Structure chapter instead)

Make sure special cases are truly special

OK my take on this is "Don't just add if... branches to make your program behave correctly"

Keep it simple to make it faster

Don't diddle code to make it faster - find a better algorithm

So, not always possible but generally good to keep in mind.

Instrument your programs. Measure before making "efficiency changes.

Yep - you don't know if you've made anything better if you can't measure it.

Chapter 8. Documentation

Again this section is a nice departure with pretty sound, timeless advice. More weird Fortran however.

Don't just echo the code with comments - make every comment count

Takes me back a bit to my days learning C at school:

// add ten to i
i += 10;

Don't comment bad code - rewrite it

Once more, not exactly true in every case - but if you find yourself writing /* This is hard to follow, but what's happening is ... */ it's generally a bad sign

Use variable names that mean something

(Painfully absent from most of the snippets)

Use statement labels that mean something

(Ok we don't usually use labels/gotos anymore)

Format a program to help the reader understand it

Indent to show the logical structure of a program

Document your data layouts

Don't over-comment

(kinda repeating the "Don't just echo the code ..." one, but sure)

Conclusion

I really chose this poorly as the first book I was gonna keep notes on - I didn't find a lot of things I really needed to think about. Normally I like it when a programming book has lots of code snippets to pick through and understand. But the PL/I and Fortran was truly painful, I wasn't learning anything by reading them and found myself deliberately not packing my Kindle in the morning so I'd have to find something else to read on the Tram to and from work.

Chapters 7 and 8 still hold up today, however. I wonder what a modern version of this book would look like, what languages it'd use and who would write it. Maybe similar to Code Complete?

I really did not know the level of notes to keep, and it felt kinda stupid to have a note for every one of the programming guidelines.

Sean McLemon | Notes

Home | Czech | Blog | GitHub | Advent Of Code | Notes