* doc/cppinternals.texi: Update.
From-SVN: r45839
This commit is contained in:
parent
ef1d8fc882
commit
4cf817a7eb
2 changed files with 135 additions and 51 deletions
|
@ -1,3 +1,7 @@
|
|||
2001-09-27 Neil Booth <neil@daikokuya.demon.co.uk>
|
||||
|
||||
* doc/cppinternals.texi: Update.
|
||||
|
||||
2001-09-26 Neil Booth <neil@daikokuya.demon.co.uk>
|
||||
|
||||
* cpphash.h (struct cpp_pool): Remove locks and locked.
|
||||
|
|
|
@ -41,8 +41,8 @@ into another language, under the above conditions for modified versions.
|
|||
@titlepage
|
||||
@c @finalout
|
||||
@title Cpplib Internals
|
||||
@subtitle Last revised Jan 2001
|
||||
@subtitle for GCC version 3.0
|
||||
@subtitle Last revised September 2001
|
||||
@subtitle for GCC version 3.1
|
||||
@author Neil Booth
|
||||
@page
|
||||
@vskip 0pt plus 1filll
|
||||
|
@ -69,14 +69,14 @@ into another language, under the above conditions for modified versions.
|
|||
@node Top, Conventions,, (DIR)
|
||||
@chapter Cpplib---the core of the GNU C Preprocessor
|
||||
|
||||
The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is
|
||||
The GNU C preprocessor in GCC 3.x has been completely rewritten. It is
|
||||
now implemented as a library, cpplib, so it can be easily shared between
|
||||
a stand-alone preprocessor, and a preprocessor integrated with the C,
|
||||
C++ and Objective-C front ends. It is also available for use by other
|
||||
programs, though this is not recommended as its exposed interface has
|
||||
not yet reached a point of reasonable stability.
|
||||
|
||||
This library has been written to be re-entrant, so that it can be used
|
||||
The library has been written to be re-entrant, so that it can be used
|
||||
to preprocess many files simultaneously if necessary. It has also been
|
||||
written with the preprocessing token as the fundamental unit; the
|
||||
preprocessor in previous versions of GCC would operate on text strings
|
||||
|
@ -86,8 +86,6 @@ This brief manual documents some of the internals of cpplib, and a few
|
|||
tricky issues encountered. It also describes certain behaviour we would
|
||||
like to preserve, such as the format and spacing of its output.
|
||||
|
||||
Identifiers, macro expansion, hash nodes, lexing.
|
||||
|
||||
@menu
|
||||
* Conventions:: Conventions used in the code.
|
||||
* Lexer:: The combined C, C++ and Objective-C Lexer.
|
||||
|
@ -123,18 +121,106 @@ behaviour.
|
|||
@node Lexer, Whitespace, Conventions, Top
|
||||
@unnumbered The Lexer
|
||||
@cindex lexer
|
||||
@cindex tokens
|
||||
|
||||
The lexer is contained in the file @file{cpplex.c}. We want to have a
|
||||
lexer that is single-pass, for efficiency reasons. We would also like
|
||||
the lexer to only step forwards through the input files, and not step
|
||||
back. This will make future changes to support different character
|
||||
sets, in particular state or shift-dependent ones, much easier.
|
||||
@section Overview
|
||||
The lexer is contained in the file @file{cpplex.c}. It is a hand-coded
|
||||
lexer, and not implemented as a state machine. It can understand C, C++
|
||||
and Objective-C source code, and has been extended to allow reasonably
|
||||
successful preprocessing of assembly language. The lexer does not make
|
||||
an initial pass to strip out trigraphs and escaped newlines, but handles
|
||||
them as they are encountered in a single pass of the input file. It
|
||||
returns preprocessing tokens individually, not a line at a time.
|
||||
|
||||
This file also contains all information needed to spell a token, i.e.@: to
|
||||
output it either in a diagnostic or to a preprocessed output file. This
|
||||
information is not exported, but made available to clients through such
|
||||
functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
|
||||
It is mostly transparent to users of the library, since the library's
|
||||
interface for obtaining the next token, @code{cpp_get_token}, takes care
|
||||
of lexing new tokens, handling directives, and expanding macros as
|
||||
necessary. However, the lexer does expose some functionality so that
|
||||
clients of the library can easily spell a given token, such as
|
||||
@code{cpp_spell_token} and @code{cpp_token_len}. These functions are
|
||||
useful when generating diagnostics, and for emitting the preprocessed
|
||||
output.
|
||||
|
||||
@section Lexing a token
|
||||
Lexing of an individual token is handled by @code{_cpp_lex_direct} and
|
||||
its subroutines. In its current form the code is quite complicated,
|
||||
with read ahead characters and suchlike, since it strives to not step
|
||||
back in the character stream in preparation for handling non-ASCII file
|
||||
encodings. The current plan is to convert any such files to UTF-8
|
||||
before processing them. This complexity is therefore unnecessary and
|
||||
will be removed, so I'll not discuss it further here.
|
||||
|
||||
The job of @code{_cpp_lex_direct} is simply to lex a token. It is not
|
||||
responsible for issues like directive handling, returning lookahead
|
||||
tokens directly, multiple-include optimisation, or conditional block
|
||||
skipping. It necessarily has a minor r@^ole to play in memory
|
||||
management of lexed lines. I discuss these issues in a separate section
|
||||
(@pxref{Lexing a line}).
|
||||
|
||||
The lexer places the token it lexes into storage pointed to by the
|
||||
variable @var{cur_token}, and then increments it. This variable is
|
||||
important for correct diagnostic positioning. Unless a specific line
|
||||
and column are passed to the diagnostic routines, they will examine the
|
||||
@var{line} and @var{col} values of the token just before the location
|
||||
that @var{cur_token} points to, and use that location to report the
|
||||
diagnostic.
|
||||
|
||||
The lexer does not consider whitespace to be a token in its own right.
|
||||
If whitespace (other than a new line) precedes a token, it sets the
|
||||
@code{PREV_WHITE} bit in the token's flags. Each token has its
|
||||
@var{line} and @var{col} variables set to the line and column of the
|
||||
first character of the token. This line number is the line number in
|
||||
the translation unit, and can be converted to a source (file, line) pair
|
||||
using the line map code.
|
||||
|
||||
The first token on a logical, i.e.@: unescaped, line has the flag
|
||||
@code{BOL} set for beginning-of-line. This flag is intended for
|
||||
internal use, both to distinguish a @samp{#} that begins a directive
|
||||
from one that doesn't, and to generate a callback to clients that want
|
||||
to be notified about the start of every non-directive line with tokens
|
||||
on it. Clients cannot reliably determine this for themselves: the first
|
||||
token might be a macro, and the tokens of a macro expansion do not have
|
||||
the @code{BOL} flag set. The macro expansion may even be empty, and the
|
||||
next token on the line certainly won't have the @code{BOL} flag set.
|
||||
|
||||
New lines are treated specially; exactly how the lexer handles them is
|
||||
context-dependent. The C standard mandates that directives are
|
||||
terminated by the first unescaped newline character, even if it appears
|
||||
in the middle of a macro expansion. Therefore, if the state variable
|
||||
@var{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
|
||||
which is normally used to indicate end-of-file, to indicate
|
||||
end-of-directive. In a directive a @code{CPP_EOF} token never means
|
||||
end-of-file. Conveniently, if the caller was @code{collect_args}, it
|
||||
already handles @code{CPP_EOF} as if it were end-of-file, and reports an
|
||||
error about an unterminated macro argument list.
|
||||
|
||||
The C standard also specifies that a new line in the middle of the
|
||||
arguments to a macro is treated as whitespace. This white space is
|
||||
important in case the macro argument is stringified. The state variable
|
||||
@code{parsing_args} is non-zero when the preprocessor is collecting the
|
||||
arguments to a macro call. It is set to 1 when looking for the opening
|
||||
parenthesis to a function-like macro, and 2 when collecting the actual
|
||||
arguments up to the closing parenthesis, since these two cases need to
|
||||
be distinguished sometimes. One such time is here: the lexer sets the
|
||||
@code{PREV_WHITE} flag of a token if it meets a new line when
|
||||
@code{parsing_args} is set to 2. It doesn't set it if it meets a new
|
||||
line when @code{parsing_args} is 1, since then code like
|
||||
|
||||
@smallexample
|
||||
#define foo() bar
|
||||
foo
|
||||
baz
|
||||
@end smallexample
|
||||
|
||||
@noindent would be output with an erroneous space before @samp{baz}:
|
||||
|
||||
@smallexample
|
||||
foo
|
||||
baz
|
||||
@end smallexample
|
||||
|
||||
This is a good example of the subtlety of getting token spacing correct
|
||||
in the preprocessor; there are plenty of tests in the testsuite for
|
||||
corner cases like this.
|
||||
|
||||
The most painful aspect of lexing ISO-standard C and C++ is handling
|
||||
trigraphs and backlash-escaped newlines. Trigraphs are processed before
|
||||
|
@ -148,62 +234,56 @@ within the characters of an identifier, and even between the @samp{*}
|
|||
and @samp{/} that terminates a comment. Moreover, you cannot be sure
|
||||
there is just one---there might be an arbitrarily long sequence of them.
|
||||
|
||||
So the routine @samp{parse_identifier}, that lexes an identifier, cannot
|
||||
assume that it can scan forwards until the first non-identifier
|
||||
So, for example, the routine that lexes a number, @code{parse_number},
|
||||
cannot assume that it can scan forwards until the first non-number
|
||||
character and be done with it, because this could be the @samp{\}
|
||||
introducing an escaped newline, or the @samp{?} introducing the trigraph
|
||||
sequence that represents the @samp{\} of an escaped newline. Similarly
|
||||
for the routine that handles numbers, @samp{parse_number}. If these
|
||||
routines stumble upon a @samp{?} or @samp{\}, they call
|
||||
@samp{skip_escaped_newlines} to skip over any potential escaped newlines
|
||||
before checking whether they can finish.
|
||||
sequence that represents the @samp{\} of an escaped newline. If it
|
||||
encounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
|
||||
to skip over any potential escaped newlines before checking whether the
|
||||
number has been finished.
|
||||
|
||||
Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
|
||||
Similarly code in the main body of @code{_cpp_lex_direct} cannot simply
|
||||
check for a @samp{=} after a @samp{+} character to determine whether it
|
||||
has a @samp{+=} token; it needs to be prepared for an escaped newline of
|
||||
some sort. These cases use the function @samp{get_effective_char},
|
||||
which returns the first character after any intervening newlines.
|
||||
some sort. Such cases use the function @code{get_effective_char}, which
|
||||
returns the first character after any intervening escaped newlines.
|
||||
|
||||
The lexer needs to keep track of the correct column position,
|
||||
including counting tabs as specified by the @option{-ftabstop=} option.
|
||||
This should be done even within comments; C-style comments can appear in
|
||||
the middle of a line, and we want to report diagnostics in the correct
|
||||
The lexer needs to keep track of the correct column position, including
|
||||
counting tabs as specified by the @option{-ftabstop=} option. This
|
||||
should be done even within C-style comments; they can appear in the
|
||||
middle of a line, and we want to report diagnostics in the correct
|
||||
position for text appearing after the end of the comment.
|
||||
|
||||
Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
|
||||
Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
|
||||
may be invalid and require a diagnostic. However, if they appear in a
|
||||
macro expansion we don't want to complain with each use of the macro.
|
||||
It is therefore best to catch them during the lexing stage, in
|
||||
@samp{parse_identifier}. In both cases, whether a diagnostic is needed
|
||||
or not is dependent upon lexer state. For example, we don't want to
|
||||
issue a diagnostic for re-poisoning a poisoned identifier, or for using
|
||||
@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
|
||||
Therefore @samp{parse_identifier} makes use of flags to determine
|
||||
@code{parse_identifier}. In both cases, whether a diagnostic is needed
|
||||
or not is dependent upon the lexer's state. For example, we don't want
|
||||
to issue a diagnostic for re-poisoning a poisoned identifier, or for
|
||||
using @code{__VA_ARGS__} in the expansion of a variable-argument macro.
|
||||
Therefore @code{parse_identifier} makes use of state flags to determine
|
||||
whether a diagnostic is appropriate. Since we change state on a
|
||||
per-token basis, and don't lex whole lines at a time, this is not a
|
||||
problem.
|
||||
|
||||
Another place where state flags are used to change behaviour is whilst
|
||||
parsing header names. Normally, a @samp{<} would be lexed as a single
|
||||
token. After a @code{#include} directive, though, it should be lexed
|
||||
as a single token as far as the nearest @samp{>} character. Note that
|
||||
we don't allow the terminators of header names to be escaped; the first
|
||||
lexing header names. Normally, a @samp{<} would be lexed as a single
|
||||
token. After a @code{#include} directive, though, it should be lexed as
|
||||
a single token as far as the nearest @samp{>} character. Note that we
|
||||
don't allow the terminators of header names to be escaped; the first
|
||||
@samp{"} or @samp{>} terminates the header name.
|
||||
|
||||
Interpretation of some character sequences depends upon whether we are
|
||||
lexing C, C++ or Objective-C, and on the revision of the standard in
|
||||
force. For example, @samp{::} is a single token in C++, but two
|
||||
separate @samp{:} tokens, and almost certainly a syntax error, in C@.
|
||||
Such cases are handled in the main function @samp{_cpp_lex_token}, based
|
||||
upon the flags set in the @samp{cpp_options} structure.
|
||||
force. For example, @samp{::} is a single token in C++, but in C it is
|
||||
two separate @samp{:} tokens and almost certainly a syntax error. Such
|
||||
cases are handled by @code{_cpp_lex_direct} based upon command-line
|
||||
flags stored in the @code{cpp_options} structure.
|
||||
|
||||
Note we have almost, but not quite, achieved the goal of not stepping
|
||||
backwards in the input stream. Currently @samp{skip_escaped_newlines}
|
||||
does step back, though with care it should be possible to adjust it so
|
||||
that this does not happen. For example, one tricky issue is if we meet
|
||||
a trigraph, but the command line option @option{-trigraphs} is not in
|
||||
force but @option{-Wtrigraphs} is, we need to warn about it but then
|
||||
buffer it and continue to treat it as 3 separate characters.
|
||||
@anchor{Lexing a line}
|
||||
@section Lexing a line
|
||||
|
||||
@node Whitespace, Hash Nodes, Lexer, Top
|
||||
@unnumbered Whitespace
|
||||
|
|
Loading…
Add table
Reference in a new issue