Log in

No account? Create an account
eyes black and white

Parsing Considered Harmful

So you think there is nothing interesting to say at the intersection of Computer Science and Political Economics? Well, Feynman said that Everything is interesting if you go into it deeply enough. And going deeply into it is precisely what I've been doing for ten years now. So there below are some things I have to say about Computer Science and Political Economics.

That is, things beyond the fact that both Computer Science and Political Economics are completely fallacious albeit traditional names: indeed, the former is not a Science (it is an Art, or an Engineering Enterprise, which is one and the same) and to quote Dijkstra it is no more about computers than astronomy is about telescopes, whereas the other is both against Politics and beyond Economics (in either the original Aristotelian meaning of husbandry, or the modern statist meaning of taxable monetary transactions). Good Computer Science is actually Lisp Lore and Craft; good Political Economics is Libertarian Human Action.

Note that if you're not too much into computing, you may skip directly to the paragraph that mention Political Economics and Education. Yes, this is also about Education.

The Evil of Academic Curricula in Computer Science
Chapter I
Parsing Considered Harmful

The extraordinary importance given to parsing in the academic Computer Science curriculum (at least in France) is an unhealthy way to teach procrastination to students: it is a focus given on the mostly irrelevant but nevertheless very intricate issues of building a parse tree from an arbitrary human-writable syntax, which puts back the time one has to face the only thing that really matter, semantics. Furthermore, defining syntax first leads to having complex systems that do not have a clear meaning with simple systematic way of analyzing and synthesizing sentences, but instead have a lot of limitations and special cases.

By contrast, Lisp does away with parsing by providing a simple extensible universal syntax; thus, developers don't have to worry about parsing a syntax, if ever, until the semantics is settled. In practice, defining a new syntax is hardly ever needed, and often detrimental: it leads to a tower of babel of limited-purpose little languages that are very costly and error-prone to interoperate if possible at all, whereas extending the Lisp syntax gives you a single universal language that does everything and with which you may build new features by freely combining previous features.

Occasionally, someone is dissatisfied with the stupid limitations of mainstream languages, and goes on to invent a new language that in his eyes would overcome these limitations. Because he doesn't know the power of Lisp syntax (or lack thereof), he invariably begins by inventing yet another syntax, usually by picking up pieces from well-known languages and sewing them together. By the time he is done defining the syntax of his language, he obtains a horrible new mumbo jumbo where misleading similarities to known systems and previously unknown oddities both contribute to making it all pain to learn; meanwhile, he had no time to spend on really thinking about the semantics of his language, which ends up with semantic limitations almost identical to those of previous languages; and the resulting language is thus no gain to use. If he knew about Lisp instead, he would just extend Lisp from within and focus on adding new functionality without having to waste all his time reimplementing features that exist or reinventing yet another useless new syntax. Eventually, he could do a major refactoring of his system, including a reimplementation of existing features, and a new syntax; but that would only come later. What he would start with right away, is to face the issues that matter: semantic issues. He would only spend time on such secondary issues as syntax and implementation after he had proven that he did have a point on primary issues; and he wouldn't have to handle these secondary issues unless they are found to be significant.

The need for reimplementing new languages arises all the more frequently because people use inferior languages that limit what they can express, and because they ignore the power and extensibility of Lisp. Apart from these cases of new languages, the only case when you actually want to define a new syntax is to provide a dumbed-down interface for mostly clueless users to tinker safely without any risk concerning other parts of the system. But then you'll find that you might as well provide some interactive menu-based interface for simple user configuration; the only people who'll want to use a textual interface will be programmers anyway, and they may as well use the full power of a dynamic programming language, Lisp being the best of them. And of course, the above interactive interface can be straightforwardly extracted using proper declarative programming techniques, at a fraction of the cost of a parser, and yielding a much more usable system.

Now, when the case for new syntax (or lack of need thereof) is resolved, you only ever have to actually write a parser but when you need to interact with old syntax; this means legacy data and external software, as written by clueless people who should have known better. And then, all the fancy parsing theories taught at universities end up mostly useless, since said clueless people, being clueless, did things in clueless ad-hoc ways that do not fit any theory. What you need instead is a set of libraries that handle legacy languages and data formats that you have to interoperate with. Additionally, you may need a few simple tools to extract information from whichever of the wide collection of existing ad-hoc data formats you'll have to face. That's where some parsing theory might help; but the kind of theory rehashed in formal education will only matter for the few who will specialize in writing extraction tools -- and those few will need much more than is ever actually taught in class; they will need efficient algorithms and their actual implementation, with additional intricacies due to ambiguity resolution, error handling, incremental compilation, programming tricks and optimizations, plus many annoying secondary concerns such as interoperation and internationalization, etc.

Now for the Political Economics of Education. We have seen above that all this parsing theory taught for months in universities and heavily used as the basis for examinations, is usually quite harmful and at best practically useless for all involved. Why then do academic curricula, in France and maybe other places, focus so much on parsing? Because they have official curricula done by bureaucrats who advance their own agenda: they do whatever has greater benefit and lower cost to them, disregarding benefits and costs to users. And what makes parsing theory more attractive to bureaucrats? The fact that parsing theory, just like neo-classical economics or statistical sociology, makes a big use of formal mathematical tools. These formal mathematical tools provide for both a scientific veneer and a cheap way to prepare tests that are easily correctable in a uniform objective bureaucratic way. Of course, the relevance of given mathematical tools to any actual real-life situation is moot. Yet this relevance is what academics require students to assume implicitly without critical thought -- and any critical thought about it is systematically crushed by professors. So this is all pseudo-science, and the apparent education is actually massive brainwashing. But the bureaucrats who control education don't really care about science, only about the appearance of it. What more, an actual failure with the appearance of success on their part means that they will be able to request bigger budget to increase their allegedly successful contribution to solving an otherwise growing problem. So this kind of phenomenon is exactly what is selected for by bureaucracy. As Robert Lefevre put it, Government is a disease masquerading as its own cure.

Note that sometimes, instead of faking science, academics try to fake engineering or business usage. That leads to teaching some stupid industry standard programming language or software of some sort. Students then waste years in useless courses learning less that they would in a few weeks of practice; and what they actually retain from this experience is that they have to cope in expedient ways with inexplicable arbitrary limitations of existing software that lack any understandable structure. This very limited software in turn doesn't fit any of the also very limited theories they are taught on the side. Theory is disconnected from practice. The result of all this is a permanent mess in the students' minds, whereas the goal of education should be to give them key guiding concepts that would help keep their minds clear at all times. All in all, pseudo-engineering is just as evil as pseudo-science. But it is also an altogether different topic, and I won't say more about it today.

So, what should Computer Science faculties be teaching instead? Well, they should be teaching How To Design Programs. And this isn't the kind of things that can be dictated in a uniform way by a bureaucratic authority. As Bastiat said, All monopolies are detestable, but the worst of all is the monopoly of education.




Pseudoscience is stuff presented as science that can't be proved or duplicated. See Wikipedia (http://en.wikipedia.org/wiki/Pseudoscience) for a better definition. Parsing theory, even if you don't find it "relevant", is science in the sense that it can and has been proved mathematically.

As a side note, parsing theory is relevant in CS education because it is a prelude to teaching things like Turing machines, the halting problem and P vs. NP. In my opinion, these are important things for a university-educated computer scientist to know about.


Re: pseudoscience

Hmm. Seems to me that there are (at least) two interpretations of "parsing theory." The first, which I believe is what you're defending so staunchly, is also known as formal language theory. In this interpretation, yeah, it's a prelude to (and nearly inseparable from) the theory of computation, complexity, etc. However, there's an alternate interpretation of "parsing theory," to my eyes intended by the original posting, which is typically taught in compiler classes and has to do with the nitty-gritty details of implementing lexers and parsers for particular examples of computer languages. So, to my view, you're defending that which was not attacked.

While the material covered by the second interpretation is not a bad thing for university-educated computer scientists to know, it's certainly not a prerequisite for Turing machines, the halting problem, and P vs. NP.


Re: pseudoscience

Oh yeah, that lexers and parsers stuff is crap. :) I was really talking about the theoretical foundations of it all -- regexps, grammars and the like.

Re: pseudoscience

I'd have to say that I was never formally taught any parsing theory, but I was in the top 2 of my theory class where we spent ample time on P vs NP and computability theory. Can you enlighten me on how parsing is a prelude to this?



Agreed that overemphasizing parsing _in terms of implementing programming languages like C or Pascal_ may be silly when Lisp dialects parse (mostly) simply and are at least as powerful. However, two things in defense of a certain amount of parsing in curriculum:
1) If I remember my parsing correctly from school (maybe), even balanced paren expressions like Lisp require more than a context free grammar, and are a bit nontrivial to parse. (Unless you have (READ) that is.)
2) Parsers are a great way to teach state machines, and vice versa, which are useful for lots of things.
3) Even if you write all of your code in a Lisp dialect using READ, sometimes you need to parse actual DATA, and it's not all in s-expressions. Binary data, MIME messages, XML, or urls in a CGI-like application can require surprisingly serious parsing to get exactly right (which may have security implications).

However, semantics is definately at least as important, and I agree when the high level point that CS types tend to find parsing sexier than semantics because it uses more math. (It's also easier to produce results for.)


Re: parsing

Actually, a CFG can account for balanced parens. CFGs are recognized by nondeterministic pushdown automata. These have stacks. When encountering (, push something onto the stack. When encountering ), pop it.

Remember that a CFG is a grammar with rules of the form Symb -> Symb*. A CFG for an s-expr syntax consisting of only parens and symbols looks like this:
Expr -> Atom | List
List -> ( Expr List_end
List_end -> ) | Expr List_end

I think you're probably thinking of linear grammars, which generate regular languages that are accepted by deterministic finite automata, which have no stack. Those cannot balance a possibly infinite number of parentheses, as their grammars must be finite.


Re: parsing

S-expressions /are/ trivial to parse. That's the whole point.
( Reply to this )( Parent ) Re: parsing (Anonymous) am UTC ( link ) S-expressions /are/ trivial to parse.


How often do you need to parse data?

Seeing as I'm planning to spend my weekend writing a couple of parsers for DATA, not programming languages, I'd like to point out that teaching parsing as an integral part of CS is done for a reason:

Almost all of us will invariably find that a rather large part of our work ends up being related to recognising input in various forms. It's one of the areas where people most often screw up. BADLY. With the result that we're left with brittle, insecure software.

Even a simple thing such as parsing basic arithmetic expressions is usually something most CS students would have been unable to do prior to taking a course on parsing.

It is also one of the things that helps people understand how the languages most of them will work with in the real world treats their input. Because face it, most people will not work with (nor WANT TO work with) LISP.

Personally, I've designed a couple of languages, and yes, syntax is a large part of it. Why? Because for most people the syntax matters greatly in how productive they are and directly affects how they manage to relate to the language.

Syntax is the main reason I dislike LISP - my mind is highly visual, and the visual mess (to me) that is the typical LISP program is directly painful to deal with. A program that to me is visually pleasing (thought not in the Obfuscated C contents kind of way :) ) will greatly aid me in my understanding of how it works through brief glances where a badly formatted program will lead me to have to read it in detail.

Vidar Hokstad

Re: How often do you need to parse data?

Check out this newsgroup posting from a couple of years ago (I didn't write it).

I find lisp code very visual, especially with the traditional indentation (that every closing paren is to a column more "to the right" than it's opening paren).


Re: How often do you need to parse data?

(I too think that lisp syntax is beatiful)
(and (avant-garde (imagine that syntax in 1970)).)


Re: How often do you need to parse data?

Syntax is the main reason I dislike LISP - my mind is highly visual, and the visual mess (to me) that is the typical LISP program is directly painful to deal with.

This is precisely why I dislike everything except Lisp and Scheme.

Indented s-expressions are, to me, elegant and very easy to read. Any civilized editor can be easily configured to indent s-expressions properly, and to visually match parens. Even RMSMACS will do it ;^)
" "I'd love to go out with you, but I'm having all my plants neutered. " "I'd love to go out with you, but I'm staying home to work on my cottage cheese sculpture.


Is syntax evil?

Yes Lisp is extensible, easily via user metaprogramming. Very extensible. But as you say, at the cost of having no syntax. This is the point, to me all the other philosophical stuff is a bit stretcy and extremist. The point is that Lisp is extensible because having no syntax you can add things to it and they will look like standard langauge functions. Well there's more than that, it has no syntax (a minimal syntax) in a smart way that lets you write powerfull macros too. But is having no syntax a good thing? It depends, it's of course a matter of taste, but syntax is not something evil, and parsing thus is not evil too... We use syntax-rich languages because we (erhm actually language-designers) think that this syntax, crafted for some programming paradigm that fits a particular domain could be more productive in that domain. Domain-specific languages are not evil, there's no one truth and no single paradigm. Even having no "builtin" paradigms and no syntax, like is lisp, is the truth. Sometimes you need them, for example I would prefer Mathematica or even better Maple syntax is a cas, instead of lisp. I like ruby, and rebol is a really nice internet-oriented little language. Lisp is a language for exploration, experimentation and hacking, because it's shapeless and it can be modelled easily. But it's not the only way to go. - Angelo


syntax versus semantics

There is no semantics without syntax. Only a load of mambo-jumbo.
If you are a CS student and cannot understand formal grammars and parse trees then you had better not enter into any philosophical debate or any complicated discussion about any difficult topics.
At best the use of your own language is imprecise. At worse it is totally flawed and nonsensical!

Re: syntax versus semantics

Yes. Which is precisely why a framework that puts the abstract tree at the center (and allows you to build parsers later) is much better than a framework that puts the character stream first (and hides the abstract syntax tree under a load of ugly ad-hoc algorithms, to be consumed immediately by another load of ugly ad-hoc algorithms).
eyes black and white

April 2019



Powered by LiveJournal.com