UNIT-I Part 2 Describing Syntax and Semantics
UNIT-I Part 2 Describing Syntax and Semantics
2
Introduction
Who must use language definitions?
Language designers
Implementers
Programmers (the users of the language)
Syntax
The form or structure of the expressions, statements, and
program units
Defines what is grammatically correct
Semantics
The meaning of the expressions, statements, and program units
Describing syntax is easier than describing semantics
3
Some definitions
A sentence is a valid string of characters over some
alphabet
A language is a set of sentences
The syntax rules of the language specify which strings of
characters are valid sentences
A lexeme is the lowest level syntactic unit of a language
For example: sum, +, 1234
A token is a category of lexemes
For example: identifier, plus_op, int_literal
Each token may be described by separate syntax rules
Thus we may think of sentences as strings of lexemes
rather than as strings of characters
4
Describing syntax
Syntax may be formally described using
recognition or generation
Recognition involves a recognition device R
Given an input string, R either accepts the string as
valid or rejects it
R is only used in trial-and-error mode
A recognizer is not effective in enumerating all
sentences in a language
Languages are usually infinite
The syntax analyzer part of a compiler (parser) is a
recognizer
5
Describing syntax
Generation
A language generator generates the sentences of a
language
A grammar is a language generator
One can determine if a string is a sentence by
comparing it with the structure given by a generator
6
Formal methods for describing syntax
Noam Chomsky and John Backus independently
developed similar formalisms in the 1950s
In the mid-1950s, Chomsky identified four classes of
grammars for studying linguistics
Regular grammars
Recognizer – Deterministic Finite Automaton (DFA)
Context-free grammars
Recognizer – Push-down automaton
Context-sensitive grammars
Recognizer – Linear-bounded automaton
Phrase structure grammars
Recognizer – Turing machine
The first is useful for describing tokens
Most programming languages can be (mostly) described
by the second
7
Formal methods for describing syntax
Context-Free Grammar (CFG)
A language generator
Not powerful enough to describe syntax of natural
languages
Defines a class of programming languages called
context-free languages
Backus-Naur Form (BNF)
Presented in 1959 by John Backus to describe Algol 58
Notation was slightly improved by Peter Naur
BNF is equivalent to Chomsky’s context-free grammars
8
Formal methods for describing syntax
A meta-language is a language used to describe another
language
BNF is a meta-language for programming languages
In BNF . . .
A terminal symbol is used to represent a lexeme or a token
A nonterminal symbol is used to represent a syntactic class
Examples: assignment statement, while loop, Boolean expression
A production rule defines one nonterminal symbol in terms of
terminal symbols and/or other nonterminal symbols
9
Production rule example
The following production rule defines the syntactic
class of a while statement
<while_stmt> while ( <logic_expr> ) <stmt>
The syntactic class being defined is on the left-hand
side of the arrow (LHS)
The text on the right-hand side (RHS) gives the
definition of the LHS
The RHS above consists of 3 terminals (tokens)
and 2 nonterminals (syntactic classes)
Terminals: while, (, and )
Nonterminals: <logic_expr> and <stmt>
10
Formal methods for describing syntax
Nonterminal symbols may have multiple distinct
definitions, as in . . .
<if_stmt> if <logic_expr> then <stmt>
<if_stmt> if <logic_expr> then <stmt> else <stmt>
Alternative form
<if_stmt> if <logic_expr> then <stmt>
if <logic_expr> then <stmt> else <stmt>
More compactly, . . .
<if_stmt> if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt>
The vertical bar ‘|’ is read “or”
11
Formal methods for describing syntax
The nonterminal symbol being defined on the LHS
may appear on the RHS
Such a production rule is recursive
Example
Lists can be described using recursion
<identifier_list> identifier
identifier , <identifier_list>
12
Formal methods for describing syntax
A grammar G = ( T, N, P, S ), where
T is a finite set of terminal symbols
N is a finite set of nonterminal symbols
P is a finite nonempty set of production rules
S is a start symbol representing a complete sentence
The start symbol is typically named <program>
Generation of a sentence is called a derivation
Beginning with the start symbol, a derivation applies
production rules repeatedly until a complete sentence
is generated
A complete sentence consists of only terminal symbols
13
Formal methods for describing syntax
An example grammar
<program> <stmts>
<stmts> <stmt> | <stmt> ; <stmts>
<stmt> <var> = <expr>
<var> a|b|c|d
<expr> <term> + <term> | <term> - <term>
<term> <var> | integer
15
Derivation
Every string of symbols in the derivation is called a
sentential form
including <program>
A sentence is a sentential form that has only
terminal symbols
A leftmost derivation is one in which the leftmost
nonterminal in each sentential form is the one that
is expanded next
A derivation may be leftmost or rightmost or neither
Derivation order has no effect on the language
generated by a grammar
16
Parse Tree
A parse tree is a hierarchical
representation of a derivation <program>
Each internal node is labeled <stmts>
with a nonterminal symbol
<stmt>
and each leaf is labeled with
a terminal symbol <var> = <expr>
<expr> <expr>
<expr>
Derivation
<expr> + <term> <expr> => <expr> + <term>
20
Associativity of operators
Operator associativity can also be indicated by a
grammar
<expr> <expr> + <expr> | int (ambiguous)
<expr> <expr> + int | int (unambiguous)
Example: a parse tree using the unambiguous grammar
The unambiguous grammar is <expr>
<expr>
left recursive and produces
a parse tree in which the order <expr> + int
of addition is left associative
Addition is performed in a <expr> + int
left-to-right manner
int
21
“Dangling-else” problem
Consider the grammar
<stmt> • • • | <if_stmt> | • • •
<if_stmt> if <logic_expr> then <stmt>
if <logic_expr> then <stmt> else <stmt>
23
Extended BNF (denoted EBNF)
Three abbreviations are added for convenience
Optional parts on the RHS of a production rule can be
placed in brackets
[ <optional> ]
Braces on the RHS indicate that the enclosed part may
be repeated 0 or more times
{ <repeated> }
When a single element must be chosen from a group, the
options are placed in parentheses and separated by
vertical bars
(a|b|c)
24
Extended BNF examples
Brackets
<proc_call> ident [ ( <expr_list> ) ]
Generates: myProcedure and myProcedure( a, b, c )
Languages like Ada and Pascal do not use ( ) when a method
has no parameters
Braces
<identifier_list> ident { , ident }
Generates: Larry, Curly, Moe
Choice among options
<term> int ( + | - ) int
Generates: 5 + 7 and 5 - 7
25
BNF and EBNF example
BNF:
<expr> <expr> + <term>
<expr> - <term>
<term>
<term> <term> * <factor>
<term> / <factor>
<factor>
EBNF:
<expr> <term> { ( + | - ) <term> }
<term> <factor> { ( * | / ) <factor> }
26
Extended BNF
EBNF uses metasymbols |, {, }, (, ), [, and ]
When metasymbols are also terminal symbols in
the language being defined, instances that are
terminal symbols must be quoted
<proc_call> ident [ ‘(‘ <expr_list> ‘)’ ]
When regular BNF indicates that an operator is left
associative, the corresponding EBNF does not
Example
BNF: <sum> <sum> + int
EBNF: <sum> int { + int }
This must be overcome during syntax analysis
27
Extended BNF
Sometimes a superscript + is used as an additional
metasymbol to indicate one or more repetitions
Example: The production rules
<compound_stmt> begin <stmt> { <stmt> } end
and
<compound_stmt> begin { <stmt> }+ end
are equivalent
28
Attribute grammars
Context-free grammars (CFGs) cannot describe all
of the syntax of programming languages
Typical example
a variable must be declared before it can be referenced
Something like this is called a “context-sensitive
constraint”
Text refers to it as “static semantics”
29
Attribute grammars
Static semantics refers to the legal form of a
program
This is actually syntax rather than semantics
The term “semantics” is used because the syntax
check is done during syntax analysis rather than during
parsing
The term “static” is used because the analysis required
to check the constraint can be done at compile time
30
Attribute grammars (AGs)
An attribute grammar is an extension to a CFG
Concept developed by Donald Knuth in 1968
The additional AG features describe static semantics
These features carry some semantic info along through
parse trees
Additional features
Attributes
Can be assigned values like variables
Attribute computation functions
Specify how attribute values are calculated
Predicate functions
Do the checking
31
Attribute grammars defined
Definition: An attribute grammar is a context-free
grammar G = (T, N, P, S) with the following additions:
For each grammar symbol X there is a set A(X) of
attributes
Some of these are synthesized
• These pass information up the parse tree
The remaining attributes are inherited
• These pass information down the parse tree
Each production rule has a set of attribute computation
functions that define certain attributes for the nonterminals
in the rule
Each production rule has a (possibly empty) set of
predicate functions to check for attribute consistency
32
Attribute grammars defined
Let X0 X1 ... Xn be a rule
Synthesized attributes are computed with functions
of the form
S(X0) = f(A(X1), ... , A(Xn))
S(X0) depends only X0‘s child nodes
Inherited attributes for symbols Xj on the RHS are
computed with function of the form
I(Xj) = f(A(X0), ... , A(Xn))
I(Xj) depends on Xj ’s parent as well as its siblings
33
Attribute grammars defined
Initially, there are synthesized intrinsic attributes
on the leaves
When all attributes of a parse tree have been
computed, the parse tree is fully attributed
Predicate functions for X0 X1 ... Xn are Boolean
functions defined over the attribute set
{A(X0), ... , A(Xn)}
For a program to be correct, every predicate
function for every production rule must be true
Any false predicate function value indicates a
violation of the static semantics of the language
34
Attribute grammar example
Assignments of the form: id = id + id
Design choices
Expression id's can be either int_type or real_type
Types of the two id's (RHS) must be the same
Type of the expression must match it's expected type (LHS)
BNF:
<assign> <var> = <expr>
<expr> <var> + <var>
<var> id
Attributes:
actual_type
Synthesized for <var> and <expr>
Intrinsic for id
expected_type
Inherited for <expr> from <var> in <assign> <var> = <expr>
35
The attribute grammar
Syntax rule: <assign> <var> = <expr>
Attribute computation function:
<expr>.expected_type <var>.actual_type
Syntax rule: <expr> <var>[1] + <var>[2]
Attribute computation functon:
<expr>.actual_type <var>[1].actual_type
Predicates:
<var>[1].actual_type =? <var>[2].actual_type
<expr>.expected_type =? <expr>.actual_type
Syntax rule: <var> id
Attribute computation functon:
<var>.actual_type lookup (id.type)
36
Attribute grammars
In what order are attribute values computed?
If all attributes were inherited, the tree could be
decorated in top-down order
If all attributes were synthesized, the tree could be
decorated in bottom-up order
In many cases, both kinds of attributes are used, and it
is some combination of top-down and bottom-up that
must be used
Complex problem in general
May require construction of a dependency graph showing all
attribute dependencies
37
Computation of attributes
For the assignment: “total = sum + increment”
<var>.actual_type lookup (total.type)
<expr>.expected_type <var>.actual_type
<var>[1].actual_type =? <var>[2].actual_type
<expr>.actual_type <var>[1].actual_type
<expr>.actual_type =? <expr>.expected_type
38
Semantics
The meaning of expressions, statements, and
program units is known as dynamic semantics
We consider three methods of describing dynamic
semantics
Operational semantics
Denotational semantics
Axiomatic semantics
39
Operational semantics
Operational semantics describes the meaning of a
language statement by executing the statement on a
machine, either real or simulated
The meaning of the statement is defined by the
observed change in the state of the machine
i.e., the change in memory, registers, etc.
40
Operational semantics
The best approach is to use an idealized, low-level virtual
computer, implemented as a software simulation
Then, build a translator to translate source code to the
machine code of the idealized computer
The state changes in the virtual machine brought about by
executing the code that results from translating a given
statement defines the meaning of the statement
In effect, this describes the meaning of a high-level
language statement in terms of the statements of a
simpler, low-level language
41
Operational semantics example
The C statement for ( expr1; expr2; expr3 ){ • • • }
is equivalent to:
expr1;
loop: if expr2 = 0 goto out
•••
exp3;
goto loop
out: • • •
44
Denotational semantics
The state s of a program consists of the values of all
its current variables
s = {<i1, v1>, <i2, v2>, …, <in, vn>}
Here, ik is a variable and vk is the associated value
Each vk is a mathematical object
Most semantics mapping functions for program
constructs map states to states
The state change defines the meaning of the
program construct
Expression statements (among others) map states to
values
45
Denotational semantics
Let VARMAP be a function that, when given a
variable name and a state, returns the current
value of the variable
VARMAP(ik, s) = vk
Any variable can have the special value undef
i.e., currently undefined
46
Denotational semantics example
The syntax of decimal numbers is described by the
EBNF grammar
<dec_num> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<dec_num> (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
The denotational semantics of decimal numbers
involves a semantic function that maps decimal
numbers as strings of symbols into numeric values
(mathematical objects)
47
Semantic function for decimal numbers
Mdec('0') = 0, Mdec ('1') = 1, …, Mdec ('9') = 9
Mdec (<dec_num> '0') = 10 * Mdec (<dec_num>)
Mdec (<dec_num> '1’) = 10 * Mdec (<dec_num>) + 1
•••
Mdec (<dec_num> '9') = 10 * Mdec (<dec_num>) + 9
48
Denotational semantics of expressions
Assume expressions consist of decimal integer
literals, variables, or binary expressions having one
arithmetic operator and two operands, each of
which can only be a variable or integer literal
The value of an expression is typically an integer
The value of an expression is error if it involves an undef value
Thus, expressions map onto Z {error}
<expr> <dec_num> | <var> | <binary_expr>
<binary_expr> <left_expr> <operator> <right_expr>
<left_expr> <dec_num> | <var>
<right_expr> <dec_num> | <var>
<operator> + | *
49
Semantic function for expressions
Me(<expr>, s) =
case <expr> of
<dec_num> => Mdec(<dec_num>)
<var> =>
if VARMAP(<var>, s) == undef
then error
else VARMAP(<var>, s)
<binary_expr> =>
if (Me(<binary_expr>.<left_expr>, s) == undef
or Me(<binary_expr>.<right_expr>, s) == undef)
then error
else if (<binary_expr>.<operator> = ‘+’
then Me(<binary_expr>.<left_expr>, s) + Me(<binary_expr>.<right_expr>, s)
else Me(<binary_expr>.<left_expr>, s) * Me(<binary_expr>.<right_expr>, s)
end case
Ma(x = E, s) =
if Me(E, s) == error
then
error
else
s’ = {<i1,v1’>,<i2,v2’>,...,<in,vn’>},
where, for j = 1, 2, ..., n,
vj’ = VARMAP(ij, s) when ij <> x and
vj’ = Me( E, s) when ij == x
51
Denotational semantics of logical
pretest loops
Logical pretest Ml( while B do L end, s ) =
loops map states if Mb(B, s) == undef then
to states error
Assume Msl maps else if Mb(B, s) == false then
a statement list to s
a state else if Msl(L, s) == error then
Assume Mb maps error
a Boolean expression else
to a Boolean value Ml(while B do L end, Msl(L, s) )
or to error
52
Denotational semantics of loops
The meaning of the loop is the value of the
program variables after the statements in the loop
have been executed the prescribed number of
times (assuming there have been no errors)
In essence, the loop has been converted from
iteration to recursion, where the recursive control
is mathematically defined by other recursive state
mapping functions
Recursion, when compared to iteration, is easier to
describe with mathematical rigor
53
Denotational semantics
Evaluation of denotational semantics:
Can be used to determine meaning of complete
programs in a given language
Provides a rigorous way to think about programs
Can be an aid to language design
54
Axiomatic semantics
Based on formal logic (predicate calculus)
Original purpose: formal program verification
Each statement in a program is both preceded by
and followed by an assertion about program
variables
Assertions are also known as predicates
Assertions will be written with braces { } to
distinguish them from program statements
55
Axiomatic semantics
A precondition is an assertion immediately before a
statement that describes the relationships and constraints
among variables that are true at that point in execution
A postcondition is an assertion immediately following a
statement that describes the situation at that point
Our point of view is to compute the preconditions for a given
statement from the corresponding postconditions
It is also possible to set things up in the opposite direction
A weakest precondition is the least restrictive precondition
that will guarantee the validity of the associated
postcondition
56
Axiomatic semantics
Notation: {P} S {Q}
P is the preconditon
S is a statement
Q is the postcondition
Example
Find the weakest precondition P for: {P} a = b + 1 {a > 1}
One possible precondition: {b > 10}
Weakest precondition: {b > 0}
57
Axiomatic semantics
If the weakest precondition can be computed for
each statement in a program, then a correctness
proof can be constructed for the program
Start by using the desired result as the
postcondition of the last statement and work
backward
The resulting precondition of the first statement
defines the conditions under which the program
will compute the desired result
If this precondition is the same as the program
specification, the program is correct
58
Axiomatic semantics
Weakest preconditions can be computed using an
axiom or using an inference rule
An axiom is a logical statement assumed to be
true
An inference rule is a method of inferring the truth
of one assertion on the basis of the values of other
assertions
Each statement type in the language must have an
axiom or an inference rule
We consider assignments, sequences, selection,
and loops
59
Assignment statements
Let x=E be a generic assignment statement
An axiom giving the precondition is sufficient in this case:
{Q x E} x = E {Q}
Here the weakest precondition P is given by Q x E
In other words, P is the same as Q with all instances of x replaced
by expression E
For example, consider a = a + b – 3 {a > 10}
Replace all instance of ‘a’ in {a >10} by a+b-3
This gives a+b-3>10, or b>13-a
So, { Q x E } is { b>13-a }
60
Inference rules
The general form of an inference rule is
This states that if S1, S2, S3, …, and Sn are true, then
the truth of S can be inferred
61
The Rule of Consequence
{P} S {Q}, P’ => P, Q => Q’
{P’} S {Q’}
Here, => means “implies”
This says that a postcondition can always be weakened
and a precondition can always be strengthened
Thus in the earlier example
the postcondition { a>10 } can be weakened to { a>5 }
the precondition { b>13-a } can be strengthened to
{ b>15-a }
62
Sequence statements
Since a precondition for a sequence depends on the
statements in the sequence, the weakest precondition
cannot be described by an axiom
An inference rule is needed for sequences
Consider the sequence S1;S2 of two statements with
preconditions and postconditions as follows:
{P1} S1 {P2}
{P2} S2 {P3}
The inference rule is: {P1} S1 {P2}, {P2} S2 {P3}
{P1} S1; S2 {P3}
63
Sequence statements example
Consider the following sequence and postcondition
y = 3*x + 1; x = y + 3 { x < 10 }
The weakest precondition for x = y + 3 is { y < 7 }
Since this is the postcondition for y = 3*x + 1, the
weakest precondition for the sequence is { x < 2 }
64
Selection statements
Consider only if-then-else statements
The inference rule is
{ B and P } S1 { Q }, { (not B) and P } S2 { Q}
{ P } if B then S1 else S2 { Q }
67
Example
Consider the loop: { P } while y <> x do y = y + 1 end { y = x }
An appropriate loop invariant is: I = { y <= x }
Let P = {y<=x} be the precondition for the while statement
Then
P => I is true
{ y < x } y = y + 1 { y <= x }
implies {y <= x and y <> x } y = y + 1 { y <= x },
which implies {I and B} S {I}
(I and (not B)) => Q is true because
{ y <= x and not (y <> x) } implies { y = x }, which is just Q
The loop terminates since P guarantees that initially y <= x
68
Loops
The loop invariant I is a weakened version of the
loop postcondition, and it is also a precondition.
I must be weak enough to be satisfied prior to the
beginning of the loop
When combined with the loop exit condition, I must
be strong enough to force the truth of the
postcondition
69
Axiomatic semantics
Evaluation of axiomatic semantics
Developing axioms or inference rules for all of the
statements in a language can be difficult
Axiomatic semantics is . . .
a good tool for correctness proofs
an excellent framework for reasoning about programs
Axiomatic semantics is not as useful for language
users and compiler writers
70