I was asked whether the scanner needs to recognize bad identifiers like foo.bar.baz . The short answer is "no, that's a job for the parser", but the question indicates a need to be a little clearer about exactly what the identifier (ID) token represents.
foo.bar.baz is 5 tokens: ID foo, DOT, ID bar, DOT, ID baz. A scanner should in general not try to match anything with internal structure, like foo.bar(); it is returning the atoms of a program, to be assembled into molecules by the parser. (I know the STRINGLIT token may seem like an exception, but from the parser's point of view it has no internal structure, even if the scanner has to do some work to interpret things like \n and \" inside the quoted string literal.)
Subscribe to:
Post Comments (Atom)
What about identifiers that begin with an underscore? The language manual is unclear. First is says, "Identifiers are strings (other than keywords) consisting of letters, digits, and the underscore character." It then says, "type identifiers begin with a capital letter; object identifiers begin with a lower case letter." Are leading underscores disallowed like in Ada?
ReplyDeleteObject identifiers begin with a lower case letter, and type identifiers begin with an upper case letter. Those are the only two kinds of identifier, so nothing can start with an underscore. It's still perfectly true that identiers "are strings ... consisting of letters, digits, and the underscore character." Note that it's also true in Cool (and in most programming languages I know) that an identifier cannot begin with a digit, although digits may be used freely within identifiers.
ReplyDelete