CIS 461/561, University of Oregon: January 2010

Sunday, January 31, 2010

Midterms are graded

... and will be handed back at the end of class on Monday.

Scores overall are pretty low, and what I learned from grading them is that we really aren't done with LR parsing (LR(0) and especially LALR(1)) yet. Some of you are close to mastering it, some are pretty lost, and many are somewhere in between.

Why do I even care? After all, we've got nice programs like bison and cup to do this for us. But I do care, because LR parsing is an exemplar of a couple important algorithmic patterns:

Treating a back-tracking problem (searching for a parse tree) as a non-deterministic search (sending angels out to scout different possible parses) and then, because the number of branches we need to explore is finite, folding them together into a deterministic algorithm (just like the subset construction from NFA to DFA). This is a powerful pattern that has other applications.
Applying a set of rules (adding new items (productions) in a CFSM state) until none of the rules change anything. We'll see this again in data flow analysis, and it appears in other domains as well. Often we can look at a problem this way or in other ways, but this way is simpler and clearer. For example, although most textbooks describe computation of "first" and "follow" sets in other ways, the simplest description is in terms of a set of recursive set inequalities that can be applied over and over until nothing changes.

Could you grok this just as well without mastering the details of LALR(1) parse table construction? Maybe ... but I think it helps. As an analogy, consider that you studied and were expected to really understand red-black trees in data structures, even though there are perfectly good implementations of red-black trees available to you in libraries. But red-black trees illustrate important concepts about maintaining balance within an acceptable bound and rebalancingwith local operations. You'll forget the details, but the concepts stick partly because you mastered those details once. You'll forget the details of LALR(1) parsing approximately Wednesday of spring break (and so will I, and then relearn it once again next time I teach it), but the important concepts will stick better if you force yourself to really, really understand it inside and out for at least a little while.

Friday, January 29, 2010

Midterm & key, and a couple notes

Here is the midterm you took today .

Here is a key to the midterm you took today ... although I haven't very carefully proofread it yet. Please let me know if you find an error before I find and correct it.

Looking through the completed exams, it looks to me like we will need to review parsing toward the end of the term. You've got the basic idea, but most of you had some trouble grinding out the CFSMs. Two common problems were not completing a state (adding productions for the non-terminals with dots before them) and making transitions to the wrong states (rather than to a state where the dot just moves forward). Average scores are probably going to be pretty low, but don't panic ... we just need to work on this a little more.

Tuesday, January 26, 2010

Wed. 2010.01.27 lecture plan

How to say something constructive about abstract syntax trees? You know how to build tree data structures; all the choices you face in the next phase depend on how those trees will be used. So, what I plan to do is start looking at type checking. You might want to bring along a copy of the Cool manual, because what I'd like to do is sketch a few trees for small chunks of Cool code, and then look at how we read and interpret the Cool type-checking rules with respect to those trees.

If there are any last-minute questions about the midterm, that's also fair game.

Friday, January 22, 2010

Old midterms

The questions we did on the board today come from 2006.

You might want to start with the version without the key, and only then look at the key.

There are a few older midterms available too ... but no keys for them.

I haven't reviewed these older midterms and am not sure how closely they match what we've been doing this term. I used to spend more time on LL parsing, which I passed over quite quickly this term so that we could start building the parser sooner.

By the way, I was asked whether you should know how to construct First and Follow sets. Yes, please do. Note that First is used in LALR(1) parsing as well as LL parsing. Follow is not used in LALR(1), but the lookahead sets we calculate are closely related, and you should be able to convince yourself that the lookahead set for an LALR(1) item is always a subset of the corresponding Follow set.

Another sample midterm question

I already mentioned this in lecture, and you may remember it ... but it's worth thinking through. This question was required for students in CIS 561 and optional extra credit for students in CIS 461.

L is the set of all sequences of one or more decimal digits in which each digit appears at most once. For example, "0152" is in L, and "520" is in L, but "9525" is not in L (because 5 is repeated).

Is L a regular language? If it is not regular, why not? If L is regular, how many states are in the smallest DFA that recognizes L?

Sample grammar question from a previous midterm

Here's a question I asked on the midterm in 2006. It's probably a pretty good indication of the sort of thing you should be able to do on the midterm this year.

Consider the following grammar. (The terminal symbols are $ (end-of-file), i , “+”, and “;”. "epsilon" is used to mark an empty right-hand-side.)

Pgm ::= Block $
Block ::= Block E Sep
Block ::= E Sep
E ::= E "+" i
E ::= i
Sep ::= ";"
Sep ::= epsilon

Part a (5 points): Which of the following strings are accepted by the grammar?
i + i ; i + i
i i i ; i i
; i ; i
i + + i ;
i + i + i ;

Part b (15 points): Show that the grammar is not LR(0), but an LR(0) conflict is resolved using LALR(1) lookahead. (Just show enouggh states to show how one conflict in the LR(0) machine is resolved in the LALR(1) machine.)

Lecture slides on CUP

Here is a PDF version of my notes on dealing with CUP errors:
http://www.cs.uoregon.edu/classes/10W/cis461/lecnotes/CUPguide.pdf

Thursday, January 14, 2010

Parser starter kit

Next assignment is to build a parser for Cool. A starter kit (parts of the parser, plus a driver to call it) is at http://www.cs.uoregon.edu/classes/10W/cis461/handouts/parserKit.zip

A few notes on this:
- While in theory we got all the keywords and punctuation in the scanner, as I was finishing up my version of this I discovered I had missed a few ... so that's something to check when parsing fails.
- The driver accepts -d to mean "parse in debug mode", which is quite noisy but helps you see what tokens the parser is receiving and what it is doing with them. Unfortunately it prints them as numbers, so you need to look in sym.java to make sense of it. (This is pretty stupid ... CUP could easily have translated those integer codes into their symbolic names, but it doesn't.)
- I've filled in a lot, but left you a few things to figure out ... including how to make the comparison operations <, =, <= non-associative. Note that you could use precedence declarations in CUP and not factor the grammar down as far as I have to get precedence ... feel free to do it that way if you prefer.
- In the test files, watch out for things that are in Aiken's version of Cool and not our version ---- e.g., one of the tests has loop ... pool instead of do ... od.
- While my version of this (before I chopped stuff out for you to fill back in) worked on one example, I do not guarantee it is correct. You should write several small programs to test it.
- Small is beautiful, especially when you are debugging. If your parser fails on a program of 30 lines, the first thing you should do is cut it down to a program of 5-10 lines with the same failure.

When you start getting "shift/reduce conflict" and "reduce/reduce conflict" error messages from CUP, you will need to understand the LR parsing theory that we will start talking about tomorrow (Friday the 15th).

Wednesday, January 13, 2010

Study session 10-11 Thursday?

Aran Clauson has proposed to host a study session Thursday during the same hour as class (10-11) in room 260. Sounds like a good idea to me. I will also try to be available in that time.

Saturday, January 9, 2010

The ID (identifier) token and what it represents

I was asked whether the scanner needs to recognize bad identifiers like foo.bar.baz . The short answer is "no, that's a job for the parser", but the question indicates a need to be a little clearer about exactly what the identifier (ID) token represents.

foo.bar.baz is 5 tokens: ID foo, DOT, ID bar, DOT, ID baz. A scanner should in general not try to match anything with internal structure, like foo.bar(); it is returning the atoms of a program, to be assembled into molecules by the parser. (I know the STRINGLIT token may seem like an exception, but from the parser's point of view it has no internal structure, even if the scanner has to do some work to interpret things like \n and \" inside the quoted string literal.)

Friday, January 8, 2010

Building compilers: Pedagogic vs real projects

The order we're tackling the Cool compiler is to build a lexical analyzer for the whole language, then a parser for the whole language, then a static checker for the whole language, then a code generator for the whole language. We're forced to do that because it gives us some kind of order for studying the different parts of the compiler and their underlying theory, instead of studying everything at once. But, in case it's not obvious, let me just mention that no sane person would build a real compiler this way.

How do you build a real compiler?

First you build a compiler for the teeniest tiniest subset of the language that you can possibly test ... maybe it can compile the "hello world" program and nothing else. Probably it can't compile the "hello world" program because that would involve string processing and IO ... but it might compile a procedure that can add two integers. Maybe you make it an interpreter instead of a compiler, and then you add a code generator as a second step.

And then you add some small feature, adding and fixing what you need in every phase of the compiler, from lexical analysis to code generation.

And then another feature. And so on. Add, test, revise, add, test, revise, until you have a complete compiler for the whole language. It doesn't work as a class project in a compilers course, but it's much better than what we are doing for a real compiler, whether it's for a full programming language like Java or for some itsy bitsy language you invent as an interface to some application you are building.

[Thanks Dan for prompting this thought in our after-class discussion.]

Thursday, January 7, 2010

Two more NFA => DFA problems for the board

1. From the winter 2005 midterm, a language for binary or trinary literals. A binary literal is
(0|1)+b2
and a trinary literal is
(0|1|2)b3
So for example 01110101b2 is a binary literal, 0110101b3 is a trinary literal and so is 01002201b3, but 010020101b2 is neither a binary literal nor a trinary literal. Given an NFA for this, we'll construct the DFA.

2. From the Spring 2006 midterm, a simplified quoted string problem. A quoted string (string literal) starts with a quotation mark and ends with a quotation mark. Within the string, we can have alphabetic characters. To put a quotation mark within a quoted string, we type two quotation marks together, like this:
"this is a ""string with a quoted substring"" up to here"
If we let 'a' stand for alphabetic characters and Q stand for a quotation mark, the regular expression might be something like
Q(a|(QQ))*Q
We'll use the subset construction to derive a DFA from the NFA.

Regexp => NFA => DFA problem for tomorrow's class

The main things I want to do in class Friday are:

work a few examples of translating regular expressions to non-deterministic finite-state acceptors
discuss a few practical complications that aren't apparent in the theory, like how the "maximum munch" rule and matching a lot of different patterns (with different actions) go together
discuss the limitations of lexical analysis (the pumping lemma, for those of you who took automata theory)

Here's a small example we can do on the board tomorrow, for the first two items above. Imagine your language has integer constants and floating point constants. Integer constants are a sequence of one or more integer digits. It might be specified as
[0-9]+
or equivalently as
[0-9][0-9]+
Floating point constants consist of zero or more digits, followed by a decimal point, followed by one or more digits. (So .00 is a floating point constant, but 0. is not, and neither is . ) This might be specified as
[0-9]*[.][0-9][0-9]*

We could carefully follow Thompson's construction to get the NFA for each of these and then combine them, but it's easier and less messy to just eyeball the pattern for integer constant and the pattern for floating point constant and create an automaton for each of them, then use Thompson's construction just to tie them together as
( integer constant ) | (floating point constant )
and then we can go through the subset construction to get a single deterministic finite-state acceptor that is looking for both of them and succeeds if it matches either one. You might want to try that in advance (but you don't have to). If you do, consider what should happen with the following input:
998
99.foo
.998

Tuesday, January 5, 2010

Scanner assignment up

Grab http://www.cs.uoregon.edu/classes/10W/cis461/handouts/Cool.tgz to get started on the scanner (lexical analyzer) assignment.

I have filled in a lot of trivial grungy stuff and the stuff that doesn't really have anything to do with lexical analysis ... I think what's left for you to fill in should be pretty quick. Keyword and punctuation patterns are certainly trivial ... you'll see the pattern in ~~Cup.jflex~~ Cool.jflex and just fill in the few I removed. Two patterns might take a bit more work: comments and strings. There is a pattern for comments there already, and it *might* be right, but I don't promise. There is a pattern for string literals, and I'd be really surprised if it is correct. It's really ugly, too ... like a lot of complex regular expressions. Consider using the lexical state facility of JFlex to break it down into simpler patterns that you can read, understand, and fix.

Java will complain that the generated scanner uses unsafe operations. That's because the scanner skeleton file is written in the old, pre-generics style. I'm using one of the optional skeletons because I threw in the ability to #include just for fun and to demo how that works ... it's not part of Cool, but it's something you'll want to do in another scanner some day.

Due: Monday of next week. The parser assignment will come out Monday or Wednesday.

The README.txt file says a bit more about what you'll find and how to get going. Although the amount of work here should not be large, it's a good idea to tackle it soon, so that you have a chance to ask questions when things break.

Sunday, January 3, 2010

Schedule: Two midterms and no final?

I will attend a workshop in Germany during finals week. In principle I could ask someone to proctor a regular final exam and scan and email the exams to me, and I could grade them from Germany ... but it would be a rush job, and too much could go wrong (e.g., some huge blunder in an exam question and me not there to explain what I meant). So ... what to do?

The best I can think of is two midterms instead of a midterm and a final --- and then put a bit of space between the second midterm and the final project due date, so they don't interfere too much. I'm thinking final project due Monday of dead week, second midterm on Wednesday or Friday. Comments? Better ideas?

Revise Cool language manual posted

I've revised the Cool language manual for this year's class. Most of the changes are really small an cosmetic, e.g., making the assignment symbol be ":=" instead of "<-" ("=" is a comparison in Cool). I added section 13 at the end with notes on differences between Cool and Java (and to some extent C++). Those aren't essential but might be helpful.

The link from the course web site should work, or follow it here:
http://www.cs.uoregon.edu/classes/10W/cis461/handouts/cool-manual.pdf