* doc/cppinternals.texi: Update.

From-SVN: r45839
2001-09-27 11:10:40 +00:00 · 2001-09-27 11:10:40 +00:00 · 4cf817a7eb
commit 4cf817a7eb
parent ef1d8fc882
2 changed files with 135 additions and 51 deletions
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@ -1,3 +1,7 @@
+2001-09-27  Neil Booth  <neil@daikokuya.demon.co.uk>
+
+	* doc/cppinternals.texi: Update.
+
 2001-09-26  Neil Booth  <neil@daikokuya.demon.co.uk>

 	* cpphash.h (struct cpp_pool): Remove locks and locked.
--- a/gcc/doc/cppinternals.texi
+++ b/gcc/doc/cppinternals.texi
@ -41,8 +41,8 @@ into another language, under the above conditions for modified versions.
@titlepage
@c @finalout
@title Cpplib Internals
-@subtitle Last revised Jan 2001
-@subtitle for GCC version 3.0
+@subtitle Last revised September 2001
+@subtitle for GCC version 3.1
@author Neil Booth
@page
@vskip 0pt plus 1filll
@ -69,14 +69,14 @@ into another language, under the above conditions for modified versions.
@node Top, Conventions,, (DIR)
@chapter Cpplib---the core of the GNU C Preprocessor

-The GNU C preprocessor in GCC 3.0 has been completely rewritten.  It is
+The GNU C preprocessor in GCC 3.x has been completely rewritten.  It is
 now implemented as a library, cpplib, so it can be easily shared between
 a stand-alone preprocessor, and a preprocessor integrated with the C,
 C++ and Objective-C front ends.  It is also available for use by other
 programs, though this is not recommended as its exposed interface has
 not yet reached a point of reasonable stability.

-This library has been written to be re-entrant, so that it can be used
+The library has been written to be re-entrant, so that it can be used
 to preprocess many files simultaneously if necessary.  It has also been
 written with the preprocessing token as the fundamental unit; the
 preprocessor in previous versions of GCC would operate on text strings
@ -86,8 +86,6 @@ This brief manual documents some of the internals of cpplib, and a few
 tricky issues encountered.  It also describes certain behaviour we would
 like to preserve, such as the format and spacing of its output.

-Identifiers, macro expansion, hash nodes, lexing.
-
@menu
 * Conventions::	    Conventions used in the code.
 * Lexer::	    The combined C, C++ and Objective-C Lexer.
@ -123,18 +121,106 @@ behaviour.
@node Lexer, Whitespace, Conventions, Top
@unnumbered The Lexer
@cindex lexer
-@cindex tokens

-The lexer is contained in the file @file{cpplex.c}.  We want to have a
-lexer that is single-pass, for efficiency reasons.  We would also like
-the lexer to only step forwards through the input files, and not step
-back.  This will make future changes to support different character
-sets, in particular state or shift-dependent ones, much easier.
+@section Overview
+The lexer is contained in the file @file{cpplex.c}.  It is a hand-coded
+lexer, and not implemented as a state machine.  It can understand C, C++
+and Objective-C source code, and has been extended to allow reasonably
+successful preprocessing of assembly language.  The lexer does not make
+an initial pass to strip out trigraphs and escaped newlines, but handles
+them as they are encountered in a single pass of the input file.  It
+returns preprocessing tokens individually, not a line at a time.

-This file also contains all information needed to spell a token, i.e.@: to
-output it either in a diagnostic or to a preprocessed output file.  This
-information is not exported, but made available to clients through such
-functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
+It is mostly transparent to users of the library, since the library's
+interface for obtaining the next token, @code{cpp_get_token}, takes care
+of lexing new tokens, handling directives, and expanding macros as
+necessary.  However, the lexer does expose some functionality so that
+clients of the library can easily spell a given token, such as
+@code{cpp_spell_token} and @code{cpp_token_len}.  These functions are
+useful when generating diagnostics, and for emitting the preprocessed
+output.
+
+@section Lexing a token
+Lexing of an individual token is handled by @code{_cpp_lex_direct} and
+its subroutines.  In its current form the code is quite complicated,
+with read ahead characters and suchlike, since it strives to not step
+back in the character stream in preparation for handling non-ASCII file
+encodings.  The current plan is to convert any such files to UTF-8
+before processing them.  This complexity is therefore unnecessary and
+will be removed, so I'll not discuss it further here.
+
+The job of @code{_cpp_lex_direct} is simply to lex a token.  It is not
+responsible for issues like directive handling, returning lookahead
+tokens directly, multiple-include optimisation, or conditional block
+skipping.  It necessarily has a minor r@^ole to play in memory
+management of lexed lines.  I discuss these issues in a separate section
+(@pxref{Lexing a line}).
+
+The lexer places the token it lexes into storage pointed to by the
+variable @var{cur_token}, and then increments it.  This variable is
+important for correct diagnostic positioning.  Unless a specific line
+and column are passed to the diagnostic routines, they will examine the
+@var{line} and @var{col} values of the token just before the location
+that @var{cur_token} points to, and use that location to report the
+diagnostic.
+
+The lexer does not consider whitespace to be a token in its own right.
+If whitespace (other than a new line) precedes a token, it sets the
+@code{PREV_WHITE} bit in the token's flags.  Each token has its
+@var{line} and @var{col} variables set to the line and column of the
+first character of the token.  This line number is the line number in
+the translation unit, and can be converted to a source (file, line) pair
+using the line map code.
+
+The first token on a logical, i.e.@: unescaped, line has the flag
+@code{BOL} set for beginning-of-line.  This flag is intended for
+internal use, both to distinguish a @samp{#} that begins a directive
+from one that doesn't, and to generate a callback to clients that want
+to be notified about the start of every non-directive line with tokens
+on it.  Clients cannot reliably determine this for themselves: the first
+token might be a macro, and the tokens of a macro expansion do not have
+the @code{BOL} flag set.  The macro expansion may even be empty, and the
+next token on the line certainly won't have the @code{BOL} flag set.
+
+New lines are treated specially; exactly how the lexer handles them is
+context-dependent.  The C standard mandates that directives are
+terminated by the first unescaped newline character, even if it appears
+in the middle of a macro expansion.  Therefore, if the state variable
+@var{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
+which is normally used to indicate end-of-file, to indicate
+end-of-directive.  In a directive a @code{CPP_EOF} token never means
+end-of-file.  Conveniently, if the caller was @code{collect_args}, it
+already handles @code{CPP_EOF} as if it were end-of-file, and reports an
+error about an unterminated macro argument list.
+
+The C standard also specifies that a new line in the middle of the
+arguments to a macro is treated as whitespace.  This white space is
+important in case the macro argument is stringified.  The state variable
+@code{parsing_args} is non-zero when the preprocessor is collecting the
+arguments to a macro call.  It is set to 1 when looking for the opening
+parenthesis to a function-like macro, and 2 when collecting the actual
+arguments up to the closing parenthesis, since these two cases need to
+be distinguished sometimes.  One such time is here: the lexer sets the
+@code{PREV_WHITE} flag of a token if it meets a new line when
+@code{parsing_args} is set to 2.  It doesn't set it if it meets a new
+line when @code{parsing_args} is 1, since then code like
+
+@smallexample
+#define foo() bar
+foo
+baz
+@end smallexample
+
+@noindent would be output with an erroneous space before @samp{baz}:
+
+@smallexample
+foo
+ baz
+@end smallexample
+
+This is a good example of the subtlety of getting token spacing correct
+in the preprocessor; there are plenty of tests in the testsuite for
+corner cases like this.

 The most painful aspect of lexing ISO-standard C and C++ is handling
 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
@ -148,62 +234,56 @@ within the characters of an identifier, and even between the @samp{*}
 and @samp{/} that terminates a comment.  Moreover, you cannot be sure
 there is just one---there might be an arbitrarily long sequence of them.

-So the routine @samp{parse_identifier}, that lexes an identifier, cannot
-assume that it can scan forwards until the first non-identifier
+So, for example, the routine that lexes a number, @code{parse_number},
+cannot assume that it can scan forwards until the first non-number
 character and be done with it, because this could be the @samp{\}
 introducing an escaped newline, or the @samp{?} introducing the trigraph
-sequence that represents the @samp{\} of an escaped newline.  Similarly
-for the routine that handles numbers, @samp{parse_number}.  If these
-routines stumble upon a @samp{?} or @samp{\}, they call
-@samp{skip_escaped_newlines} to skip over any potential escaped newlines
-before checking whether they can finish.
+sequence that represents the @samp{\} of an escaped newline.  If it
+encounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
+to skip over any potential escaped newlines before checking whether the
+number has been finished.

-Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
+Similarly code in the main body of @code{_cpp_lex_direct} cannot simply
 check for a @samp{=} after a @samp{+} character to determine whether it
 has a @samp{+=} token; it needs to be prepared for an escaped newline of
-some sort.  These cases use the function @samp{get_effective_char},
-which returns the first character after any intervening newlines.
+some sort.  Such cases use the function @code{get_effective_char}, which
+returns the first character after any intervening escaped newlines.

-The lexer needs to keep track of the correct column position,
-including counting tabs as specified by the @option{-ftabstop=} option.
-This should be done even within comments; C-style comments can appear in
-the middle of a line, and we want to report diagnostics in the correct
+The lexer needs to keep track of the correct column position, including
+counting tabs as specified by the @option{-ftabstop=} option.  This
+should be done even within C-style comments; they can appear in the
+middle of a line, and we want to report diagnostics in the correct
 position for text appearing after the end of the comment.

-Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
+Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
 may be invalid and require a diagnostic.  However, if they appear in a
 macro expansion we don't want to complain with each use of the macro.
 It is therefore best to catch them during the lexing stage, in
-@samp{parse_identifier}.  In both cases, whether a diagnostic is needed
-or not is dependent upon lexer state.  For example, we don't want to
-issue a diagnostic for re-poisoning a poisoned identifier, or for using
-@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
-Therefore @samp{parse_identifier} makes use of flags to determine
+@code{parse_identifier}.  In both cases, whether a diagnostic is needed
+or not is dependent upon the lexer's state.  For example, we don't want
+to issue a diagnostic for re-poisoning a poisoned identifier, or for
+using @code{__VA_ARGS__} in the expansion of a variable-argument macro.
+Therefore @code{parse_identifier} makes use of state flags to determine
 whether a diagnostic is appropriate.  Since we change state on a
 per-token basis, and don't lex whole lines at a time, this is not a
 problem.

 Another place where state flags are used to change behaviour is whilst
-parsing header names.  Normally, a @samp{<} would be lexed as a single
-token.  After a @code{#include} directive, though, it should be lexed
-as a single token as far as the nearest @samp{>} character.  Note that
-we don't allow the terminators of header names to be escaped; the first
+lexing header names.  Normally, a @samp{<} would be lexed as a single
+token.  After a @code{#include} directive, though, it should be lexed as
+a single token as far as the nearest @samp{>} character.  Note that we
+don't allow the terminators of header names to be escaped; the first
@samp{"} or @samp{>} terminates the header name.

 Interpretation of some character sequences depends upon whether we are
 lexing C, C++ or Objective-C, and on the revision of the standard in
-force.  For example, @samp{::} is a single token in C++, but two
-separate @samp{:} tokens, and almost certainly a syntax error, in C@.
-Such cases are handled in the main function @samp{_cpp_lex_token}, based
-upon the flags set in the @samp{cpp_options} structure.
+force.  For example, @samp{::} is a single token in C++, but in C it is
+two separate @samp{:} tokens and almost certainly a syntax error.  Such
+cases are handled by @code{_cpp_lex_direct} based upon command-line
+flags stored in the @code{cpp_options} structure.

-Note we have almost, but not quite, achieved the goal of not stepping
-backwards in the input stream.  Currently @samp{skip_escaped_newlines}
-does step back, though with care it should be possible to adjust it so
-that this does not happen.  For example, one tricky issue is if we meet
-a trigraph, but the command line option @option{-trigraphs} is not in
-force but @option{-Wtrigraphs} is, we need to warn about it but then
-buffer it and continue to treat it as 3 separate characters.
+@anchor{Lexing a line}
+@section Lexing a line

@node Whitespace, Hash Nodes, Lexer, Top
@unnumbered Whitespace