What do we mean by data after all?

I’m reading Structure and Interpretation of Computer Programs (the famous SICP)… what a fantastic book! I’ve not finished it yet, but until now I can surely say it’s the best book about Computer Science I ever touched since I started my studies in this field.

This book is about abstraction… about how to control the complexity of expressing ideas. And this is, in fact, the real deal of Computer Science: How to formalize ideas of process (the “How To” knowledge) in a manageable way. All these “modern” concepts of first order functions, closures, dynamic typing, message passing (a.k.a. object orientation) are nothing more than instances of a more fundamental one: abstraction.

It’s interesting to see that these “modern” things are not new at all, being well-known to many academics already in the 1970s. If it’s true that a lot of academics sometimes lose the sight of reality, it’s also true that most CS professionals are alienated to the technology, when what really matters for their work is the thinking. If this wasn’t the case, maybe the old things wouldn’t be new today =P. This book opens the mind to this and I’d like to have placed my hands on it a long time ago. Perhaps, I just wasn’t prepared =).

The SICP uses a dialect of Lisp called Scheme to introduce its subject. I’ve always heard wonders about this language from very smart people (e.g. Peter Norvig, Paul Graham, Richard Stallman), and I’ve tried to learn it a few times. But, I couldn’t see why all those smart people revered Scheme. It seemed to me just like another functional language, with an exotic notation and a lack of practical purpose. This bothered me because I really think that if a lot of smart people like something that I don’t, there’s something wrong with me =P. The problem was an extraordinary idea needs an extraordinary presentation to be fully understood and afford that “aha! moment“. I wasn’t exposed to such presentation until SICP.

One of the greatest insights SICP gives is that all in the ideal world of software is process (or procedure, which is the expression of a process), even data. And this vision of data is so confusing and surprising I decided to write a post about it. Let’s see what it means using a SICP example.

Suppose we’d like to develop a rational number library. First, I need to understand what a rational number is: two integers representing the numerator and the denominator. So, in order to operate with rational numbers we’ll need a constructor and selectors for the numerator and the denominator. In Scheme it could be something like:

(define (make-rat n d) ...)
(define (numer x) ...)
(define (denom x) ...)

From these basics we can implement procedures for a rational number arithmetic:

(define (add-rat x y)
  (make-rat (+ (* (numer x) (denom y))
               (* (numer y) (denom x)))
            (* (denom x) (denom y))))

(define (mul-rat x y)
  (make-rat (* (numer x) (numer y))
            (* (denom x) (denom y))))

...

The arithmetic procedures see a rational number by means of the constructor and the selectors, which are themselves procedures. Thus, until now, despite we think about rational numbers as data, we are working with procedures. But you can say: “you are tricking me, after all the selectors are working over CONCRETE data”. It seems reasonable… let’s implement make-ratnumer, denom and investigate this. One possibility is to represent a rational number through pairs using the cons, car and cdr Scheme primitive procedures.

(define (make-rat n d)
  (cons n d))

(define (numer x)
  (car x))

(define (denom x)
  (cdr x))

The cons procedure receives two things (doesn’t matter what they really are) and returns a pair. car/cdr extracts the first/last element of a pair. Seeing this you say: “Aha! I was right, there’s a CONCRETE data called pair. Thus, there’s a difference between procedures and data. Constructors and selectors are just a simple abstraction on top of the real concrete data.” But this is not true because cons, car and cdr are just procedures like make-rat, numer and denom. Nothing says pairs are CONCRETE data. Consider the following implementation of those procedures:

(define (cons n d)
  (define (dispatch m)
    (cond ((= m 0) n)
          ((= m 1) d)))
  dispatch)

(define (car z)
  (z 0))

(define (cdr z)
  (z 1))

Pay attention to them until you understand. Can you see that “CONCRETE data called pair” is just another procedure? Just a process? It does not exist concretely, it’s an abstraction of the idea of a pair. You may argument that numbers are concrete data… “1, 2, 3… this kind of basic stuff is primitive, there’s no way it can be a procedure” you might say. Indeed, it can. For the natural numbers they’re called Church Numerals because were defined by Alonzo Church in his work on λ-calculus.

This is a little mind-boggling at first since we are used to think about procedures and data as separate complementary things, but they aren’t. If you see a program as a representation of thought and you recognize that data is just a thought, this idea becomes clear. Although this seems like just a curiosity, this notion is extremely important because at the end it’s just abstraction and Computer Science is all about this.

Of course, if data is procedure, can any procedure be seem as data? No… what distinguishes processes that are data from those that aren’t are the restrictions (or axioms) over the processes. For instance, if make-ratnumerdenom defines a rational number it must be true that:

(= (/ (numer (make-rat n d))
      (denom (make-rat n d)))
   (/ n d))

The same way, for the conscarcdr we must have:

(eq? (car (cons a b)) a)
(eq? (cdr (cons a b)) b)

To summarize we could define data as: procedures + axioms over them.

PS: An important thing here is the use of Scheme to demonstrate these concepts. Why not use C or Python or any other mainstream language? After all, the behavior of the code presented could be emulated in other languages than Scheme. The point is that Scheme was designed to make these concepts transparent. When you write a Scheme program, you think this way. In fact, this way of thinking is based on a solid elegant theory called λ-calculus (the basis of Church Numerals) and Scheme was conceived to model it. This is the reason Scheme seems so elegant to me. All just fit right. It’s the most orthogonal language I know and this does not make the programmer unproductive, quite the contrary.

Posted on August 5, 2011, in Functional programming, Scheme and tagged , . Bookmark the permalink. Leave a comment.

Leave a comment