tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

TokenStream.h (109642B)


      1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*-
      2 * vim: set ts=8 sts=2 et sw=2 tw=80:
      3 * This Source Code Form is subject to the terms of the Mozilla Public
      4 * License, v. 2.0. If a copy of the MPL was not distributed with this
      5 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
      6 
      7 /*
      8 * Streaming access to the raw tokens of JavaScript source.
      9 *
     10 * Because JS tokenization is context-sensitive -- a '/' could be either a
     11 * regular expression *or* a division operator depending on context -- the
     12 * various token stream classes are mostly not useful outside of the Parser
     13 * where they reside.  We should probably eventually merge the two concepts.
     14 */
     15 #ifndef frontend_TokenStream_h
     16 #define frontend_TokenStream_h
     17 
     18 /*
     19 * [SMDOC] Parser Token Stream
     20 *
     21 * A token stream exposes the raw tokens -- operators, names, numbers,
     22 * keywords, and so on -- of JavaScript source code.
     23 *
     24 * These are the components of the overall token stream concept:
     25 * TokenStreamShared, TokenStreamAnyChars, TokenStreamCharsBase<Unit>,
     26 * TokenStreamChars<Unit>, and TokenStreamSpecific<Unit, AnyCharsAccess>.
     27 *
     28 * == TokenStreamShared → ∅ ==
     29 *
     30 * Certain aspects of tokenizing are used everywhere:
     31 *
     32 *   * modifiers (used to select which context-sensitive interpretation of a
     33 *     character should be used to decide what token it is) and modifier
     34 *     assertion handling;
     35 *   * flags on the overall stream (have we encountered any characters on this
     36 *     line?  have we hit a syntax error?  and so on);
     37 *   * and certain token-count constants.
     38 *
     39 * These are all defined in TokenStreamShared.  (They could be namespace-
     40 * scoped, but it seems tentatively better not to clutter the namespace.)
     41 *
     42 * == TokenStreamAnyChars → TokenStreamShared ==
     43 *
     44 * Certain aspects of tokenizing have meaning independent of the character type
     45 * of the source text being tokenized: line/column number information, tokens
     46 * in lookahead from determining the meaning of a prior token, compilation
     47 * options, the filename, flags, source map URL, access to details of the
     48 * current and next tokens (is the token of the given type?  what name or
     49 * number is contained in the token?  and other queries), and others.
     50 *
     51 * All this data/functionality *could* be duplicated for both single-byte and
     52 * double-byte tokenizing, but there are two problems.  First, it's potentially
     53 * wasteful if the compiler doesnt recognize it can unify the concepts.  (And
     54 * if any-character concepts are intermixed with character-specific concepts,
     55 * potentially the compiler *can't* unify them because offsets into the
     56 * hypothetical TokenStream<Unit>s would differ.)  Second, some of this stuff
     57 * needs to be accessible in ParserBase, the aspects of JS language parsing
     58 * that have meaning independent of the character type of the source text being
     59 * parsed.  So we need a separate data structure that ParserBase can hold on to
     60 * for it.  (ParserBase isn't the only instance of this, but it's certainly the
     61 * biggest case of it.)  Ergo, TokenStreamAnyChars.
     62 *
     63 * == TokenStreamCharsShared → ∅ ==
     64 *
     65 * Some functionality has meaning independent of character type, yet has no use
     66 * *unless* you know the character type in actual use.  It *could* live in
     67 * TokenStreamAnyChars, but it makes more sense to live in a separate class
     68 * that character-aware token information can simply inherit.
     69 *
     70 * This class currently exists only to contain a char16_t buffer, transiently
     71 * used to accumulate strings in tricky cases that can't just be read directly
     72 * from source text.  It's not used outside character-aware tokenizing, so it
     73 * doesn't make sense in TokenStreamAnyChars.
     74 *
     75 * == TokenStreamCharsBase<Unit> → TokenStreamCharsShared ==
     76 *
     77 * Certain data structures in tokenizing are character-type-specific: namely,
     78 * the various pointers identifying the source text (including current offset
     79 * and end).
     80 *
     81 * Additionally, some functions operating on this data are defined the same way
     82 * no matter what character type you have (e.g. current offset in code units
     83 * into the source text) or share a common interface regardless of character
     84 * type (e.g. consume the next code unit if it has a given value).
     85 *
     86 * All such functionality lives in TokenStreamCharsBase<Unit>.
     87 *
     88 * == SpecializedTokenStreamCharsBase<Unit> → TokenStreamCharsBase<Unit> ==
     89 *
     90 * Certain tokenizing functionality is specific to a single character type.
     91 * For example, JS's UTF-16 encoding recognizes no coding errors, because lone
     92 * surrogates are not an error; but a UTF-8 encoding must recognize a variety
     93 * of validation errors.  Such functionality is defined only in the appropriate
     94 * SpecializedTokenStreamCharsBase specialization.
     95 *
     96 * == GeneralTokenStreamChars<Unit, AnyCharsAccess> →
     97 *    SpecializedTokenStreamCharsBase<Unit> ==
     98 *
     99 * Some functionality operates differently on different character types, just
    100 * as for TokenStreamCharsBase, but additionally requires access to character-
    101 * type-agnostic information in TokenStreamAnyChars.  For example, getting the
    102 * next character performs different steps for different character types and
    103 * must access TokenStreamAnyChars to update line break information.
    104 *
    105 * Such functionality, if it can be defined using the same algorithm for all
    106 * character types, lives in GeneralTokenStreamChars<Unit, AnyCharsAccess>.
    107 * The AnyCharsAccess parameter provides a way for a GeneralTokenStreamChars
    108 * instance to access its corresponding TokenStreamAnyChars, without inheriting
    109 * from it.
    110 *
    111 * GeneralTokenStreamChars<Unit, AnyCharsAccess> is just functionality, no
    112 * actual member data.
    113 *
    114 * Such functionality all lives in TokenStreamChars<Unit, AnyCharsAccess>, a
    115 * declared-but-not-defined template class whose specializations have a common
    116 * public interface (plus whatever private helper functions are desirable).
    117 *
    118 * == TokenStreamChars<Unit, AnyCharsAccess> →
    119 *    GeneralTokenStreamChars<Unit, AnyCharsAccess> ==
    120 *
    121 * Some functionality is like that in GeneralTokenStreamChars, *but* it's
    122 * defined entirely differently for different character types.
    123 *
    124 * For example, consider "match a multi-code unit code point" (hypothetically:
    125 * we've only implemented two-byte tokenizing right now):
    126 *
    127 *   * For two-byte text, there must be two code units to get, the leading code
    128 *     unit must be a UTF-16 lead surrogate, and the trailing code unit must be
    129 *     a UTF-16 trailing surrogate.  (If any of these fail to hold, a next code
    130 *     unit encodes that code point and is not multi-code unit.)
    131 *   * For single-byte Latin-1 text, there are no multi-code unit code points.
    132 *   * For single-byte UTF-8 text, the first code unit must have N > 1 of its
    133 *     highest bits set (and the next unset), and |N - 1| successive code units
    134 *     must have their high bit set and next-highest bit unset, *and*
    135 *     concatenating all unconstrained bits together must not produce a code
    136 *     point value that could have been encoded in fewer code units.
    137 *
    138 * This functionality can't be implemented as member functions in
    139 * GeneralTokenStreamChars because we'd need to *partially specialize* those
    140 * functions -- hold Unit constant while letting AnyCharsAccess vary.  But
    141 * C++ forbids function template partial specialization like this: either you
    142 * fix *all* parameters or you fix none of them.
    143 *
    144 * Fortunately, C++ *does* allow *class* template partial specialization.  So
    145 * TokenStreamChars is a template class with one specialization per Unit.
    146 * Functions can be defined differently in the different specializations,
    147 * because AnyCharsAccess as the only template parameter on member functions
    148 * *can* vary.
    149 *
    150 * All TokenStreamChars<Unit, AnyCharsAccess> specializations, one per Unit,
    151 * are just functionality, no actual member data.
    152 *
    153 * == TokenStreamSpecific<Unit, AnyCharsAccess> →
    154 *    TokenStreamChars<Unit, AnyCharsAccess>, TokenStreamShared,
    155 *    ErrorReporter ==
    156 *
    157 * TokenStreamSpecific is operations that are parametrized on character type
    158 * but implement the *general* idea of tokenizing, without being intrinsically
    159 * tied to character type.  Notably, this includes all operations that can
    160 * report warnings or errors at particular offsets, because we include a line
    161 * of context with such errors -- and that necessarily accesses the raw
    162 * characters of their specific type.
    163 *
    164 * Much TokenStreamSpecific operation depends on functionality in
    165 * TokenStreamAnyChars.  The obvious solution is to inherit it -- but this
    166 * doesn't work in Parser: its ParserBase base class needs some
    167 * TokenStreamAnyChars functionality without knowing character type.
    168 *
    169 * The AnyCharsAccess type parameter is a class that statically converts from a
    170 * TokenStreamSpecific* to its corresponding TokenStreamAnyChars.  The
    171 * TokenStreamSpecific in Parser<ParseHandler, Unit> can then specify a class
    172 * that properly converts from TokenStreamSpecific Parser::tokenStream to
    173 * TokenStreamAnyChars ParserBase::anyChars.
    174 *
    175 * Could we hardcode one set of offset calculations for this and eliminate
    176 * AnyCharsAccess?  No.  Offset calculations possibly could be hardcoded if
    177 * TokenStreamSpecific were present in Parser before Parser::handler, assuring
    178 * the same offsets in all Parser-related cases.  But there's still a separate
    179 * TokenStream class, that requires different offset calculations.  So even if
    180 * we wanted to hardcode this (it's not clear we would, because forcing the
    181 * TokenStreamSpecific declarer to specify this is more explicit), we couldn't.
    182 */
    183 
    184 #include "mozilla/ArrayUtils.h"
    185 #include "mozilla/Assertions.h"
    186 #include "mozilla/Attributes.h"
    187 #include "mozilla/Casting.h"
    188 #include "mozilla/Maybe.h"
    189 #include "mozilla/MemoryChecking.h"
    190 #include "mozilla/Span.h"
    191 #include "mozilla/TextUtils.h"
    192 #include "mozilla/Utf8.h"
    193 
    194 #include <algorithm>
    195 #include <stdarg.h>
    196 #include <stddef.h>
    197 #include <stdint.h>
    198 #include <stdio.h>
    199 #include <type_traits>
    200 
    201 #include "jspubtd.h"
    202 
    203 #include "frontend/ErrorReporter.h"
    204 #include "frontend/ParserAtom.h"  // ParserAtom, ParserAtomsTable, TaggedParserAtomIndex
    205 #include "frontend/Token.h"
    206 #include "frontend/TokenKind.h"
    207 #include "js/CharacterEncoding.h"  // JS::ConstUTF8CharsZ
    208 #include "js/ColumnNumber.h"  // JS::LimitedColumnNumberOneOrigin, JS::ColumnNumberOneOrigin, JS::ColumnNumberUnsignedOffset
    209 #include "js/CompileOptions.h"
    210 #include "js/friend/ErrorMessages.h"  // JSMSG_*
    211 #include "js/HashTable.h"             // js::HashMap
    212 #include "js/RegExpFlags.h"           // JS::RegExpFlags
    213 #include "js/UniquePtr.h"
    214 #include "js/Vector.h"
    215 #include "util/Unicode.h"
    216 #include "vm/ErrorReporting.h"
    217 
    218 struct KeywordInfo;
    219 
    220 namespace js {
    221 
    222 class FrontendContext;
    223 
    224 namespace frontend {
    225 
    226 // True if str is a keyword.
    227 bool IsKeyword(TaggedParserAtomIndex atom);
    228 
    229 // If `name` is reserved word, returns the TokenKind of it.
    230 // TokenKind::Limit otherwise.
    231 extern TokenKind ReservedWordTokenKind(TaggedParserAtomIndex name);
    232 
    233 // If `name` is reserved word, returns string representation of it.
    234 // nullptr otherwise.
    235 extern const char* ReservedWordToCharZ(TaggedParserAtomIndex name);
    236 
    237 // If `tt` is reserved word, returns string representation of it.
    238 // nullptr otherwise.
    239 extern const char* ReservedWordToCharZ(TokenKind tt);
    240 
    241 enum class DeprecatedContent : uint8_t {
    242  // No deprecated content was present.
    243  None = 0,
    244  // Octal literal not prefixed by "0o" but rather by just "0", e.g. 0755.
    245  OctalLiteral,
    246  // Octal character escape, e.g. "hell\157 world".
    247  OctalEscape,
    248  // NonOctalDecimalEscape, i.e. "\8" or "\9".
    249  EightOrNineEscape,
    250 };
    251 
    252 struct TokenStreamFlags {
    253  // Hit end of file.
    254  bool isEOF : 1;
    255  // Non-whitespace since start of line.
    256  bool isDirtyLine : 1;
    257  // Hit a syntax error, at start or during a token.
    258  bool hadError : 1;
    259 
    260  // The nature of any deprecated content seen since last reset.
    261  // We have to uint8_t instead DeprecatedContent to work around a GCC 7 bug.
    262  // https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61414
    263  uint8_t sawDeprecatedContent : 2;
    264 
    265  TokenStreamFlags()
    266      : isEOF(false),
    267        isDirtyLine(false),
    268        hadError(false),
    269        sawDeprecatedContent(uint8_t(DeprecatedContent::None)) {}
    270 };
    271 
    272 template <typename Unit>
    273 class TokenStreamPosition;
    274 
    275 /**
    276 * TokenStream types and constants that are used in both TokenStreamAnyChars
    277 * and TokenStreamSpecific.  Do not add any non-static data members to this
    278 * class!
    279 */
    280 class TokenStreamShared {
    281 protected:
    282  // 1 current + (3 lookahead if EXPLICIT_RESOURCE_MANAGEMENT is enabled
    283  // else 2 lookahead and rounded up to ^2)
    284  // NOTE: This must be power of 2, in order to make `ntokensMask` work.
    285  static constexpr size_t ntokens = 4;
    286 
    287  static constexpr unsigned ntokensMask = ntokens - 1;
    288 
    289  template <typename Unit>
    290  friend class TokenStreamPosition;
    291 
    292 public:
    293 #ifdef ENABLE_EXPLICIT_RESOURCE_MANAGEMENT
    294  // We need a lookahead buffer of atleast 3 for the AwaitUsing syntax.
    295  static constexpr unsigned maxLookahead = 3;
    296 #else
    297  static constexpr unsigned maxLookahead = 2;
    298 #endif
    299 
    300  using Modifier = Token::Modifier;
    301  static constexpr Modifier SlashIsDiv = Token::SlashIsDiv;
    302  static constexpr Modifier SlashIsRegExp = Token::SlashIsRegExp;
    303  static constexpr Modifier SlashIsInvalid = Token::SlashIsInvalid;
    304 
    305  static void verifyConsistentModifier(Modifier modifier,
    306                                       const Token& nextToken) {
    307    MOZ_ASSERT(
    308        modifier == nextToken.modifier || modifier == SlashIsInvalid,
    309        "This token was scanned with both SlashIsRegExp and SlashIsDiv, "
    310        "indicating the parser is confused about how to handle a slash here. "
    311        "See comment at Token::Modifier.");
    312  }
    313 };
    314 
    315 static_assert(std::is_empty_v<TokenStreamShared>,
    316              "TokenStreamShared shouldn't bloat classes that inherit from it");
    317 
    318 template <typename Unit, class AnyCharsAccess>
    319 class TokenStreamSpecific;
    320 
    321 template <typename Unit>
    322 class MOZ_STACK_CLASS TokenStreamPosition final {
    323 public:
    324  template <class AnyCharsAccess>
    325  inline explicit TokenStreamPosition(
    326      TokenStreamSpecific<Unit, AnyCharsAccess>& tokenStream);
    327 
    328 private:
    329  TokenStreamPosition(const TokenStreamPosition&) = delete;
    330 
    331  // Technically only TokenStreamSpecific<Unit, AnyCharsAccess>::seek with
    332  // Unit constant and AnyCharsAccess varying must be friended, but 1) it's
    333  // hard to friend one function in template classes, and 2) C++ doesn't
    334  // allow partial friend specialization to target just that single class.
    335  template <typename Char, class AnyCharsAccess>
    336  friend class TokenStreamSpecific;
    337 
    338  const Unit* buf;
    339  TokenStreamFlags flags;
    340  unsigned lineno;
    341  size_t linebase;
    342  size_t prevLinebase;
    343  Token currentToken;
    344  unsigned lookahead;
    345  Token lookaheadTokens[TokenStreamShared::maxLookahead];
    346 };
    347 
    348 template <typename Unit>
    349 class SourceUnits;
    350 
    351 /**
    352 * This class maps:
    353 *
    354 *   * a sourceUnits offset (a 0-indexed count of code units)
    355 *
    356 * to
    357 *
    358 *   * a (1-indexed) line number and
    359 *   * a (0-indexed) offset in code *units* (not code points, not bytes) into
    360 *     that line,
    361 *
    362 * for either |Unit = Utf8Unit| or |Unit = char16_t|.
    363 *
    364 * Note that, if |Unit = Utf8Unit|, the latter quantity is *not* the same as a
    365 * column number, which is a count of UTF-16 code units.  Computing a column
    366 * number requires the offset within the line and the source units of that line
    367 * (including what type |Unit| is, to know how to decode them).  If you need a
    368 * column number, functions in |GeneralTokenStreamChars<Unit>| will consult
    369 * this and source units to compute it.
    370 */
    371 class SourceCoords {
    372  // For a given buffer holding source code, |lineStartOffsets_| has one
    373  // element per line of source code, plus one sentinel element.  Each
    374  // non-sentinel element holds the buffer offset for the start of the
    375  // corresponding line of source code.  For this example script,
    376  // assuming an initialLineOffset of 0:
    377  //
    378  // 1  // xyz            [line starts at offset 0]
    379  // 2  var x;            [line starts at offset 7]
    380  // 3                    [line starts at offset 14]
    381  // 4  var y;            [line starts at offset 15]
    382  //
    383  // |lineStartOffsets_| is:
    384  //
    385  //   [0, 7, 14, 15, MAX_PTR]
    386  //
    387  // To convert a "line number" to an "index" into |lineStartOffsets_|,
    388  // subtract |initialLineNum_|.  E.g. line 3's index is
    389  // (3 - initialLineNum_), which is 2.  Therefore lineStartOffsets_[2]
    390  // holds the buffer offset for the start of line 3, which is 14.  (Note
    391  // that |initialLineNum_| is often 1, but not always.
    392  //
    393  // The first element is always initialLineOffset, passed to the
    394  // constructor, and the last element is always the MAX_PTR sentinel.
    395  //
    396  // Offset-to-{line,offset-into-line} lookups are O(log n) in the worst
    397  // case (binary search), but in practice they're heavily clustered and
    398  // we do better than that by using the previous lookup's result
    399  // (lastIndex_) as a starting point.
    400  //
    401  // Checking if an offset lies within a particular line number
    402  // (isOnThisLine()) is O(1).
    403  //
    404  Vector<uint32_t, 128> lineStartOffsets_;
    405 
    406  /** The line number on which the source text begins. */
    407  uint32_t initialLineNum_;
    408 
    409  /**
    410   * The index corresponding to the last offset lookup -- used so that if
    411   * offset lookups proceed in increasing order, and and the offset appears
    412   * in the next couple lines from the last offset, we can avoid a full
    413   * binary-search.
    414   *
    415   * This is mutable because it's modified on every search, but that fact
    416   * isn't visible outside this class.
    417   */
    418  mutable uint32_t lastIndex_;
    419 
    420  uint32_t indexFromOffset(uint32_t offset) const;
    421 
    422  static const uint32_t MAX_PTR = UINT32_MAX;
    423 
    424  uint32_t lineNumberFromIndex(uint32_t index) const {
    425    return index + initialLineNum_;
    426  }
    427 
    428  uint32_t indexFromLineNumber(uint32_t lineNum) const {
    429    return lineNum - initialLineNum_;
    430  }
    431 
    432 public:
    433  SourceCoords(FrontendContext* fc, uint32_t initialLineNumber,
    434               uint32_t initialOffset);
    435 
    436  [[nodiscard]] bool add(uint32_t lineNum, uint32_t lineStartOffset);
    437  [[nodiscard]] bool fill(const SourceCoords& other);
    438 
    439  std::optional<bool> isOnThisLine(uint32_t offset, uint32_t lineNum) const {
    440    uint32_t index = indexFromLineNumber(lineNum);
    441    if (index + 1 >= lineStartOffsets_.length()) {  // +1 due to sentinel
    442      return std::nullopt;
    443    }
    444    return (lineStartOffsets_[index] <= offset &&
    445            offset < lineStartOffsets_[index + 1]);
    446  }
    447 
    448  /**
    449   * A token, computed for an offset in source text, that can be used to
    450   * access line number and line-offset information for that offset.
    451   *
    452   * LineToken *alone* exposes whether the corresponding offset is in the
    453   * the first line of source (which may not be 1, depending on
    454   * |initialLineNumber|), and whether it's in the same line as
    455   * another LineToken.
    456   */
    457  class LineToken {
    458    uint32_t index;
    459 #ifdef DEBUG
    460    uint32_t offset_;  // stored for consistency-of-use assertions
    461 #endif
    462 
    463    friend class SourceCoords;
    464 
    465   public:
    466    LineToken(uint32_t index, uint32_t offset)
    467        : index(index)
    468 #ifdef DEBUG
    469          ,
    470          offset_(offset)
    471 #endif
    472    {
    473    }
    474 
    475    bool isFirstLine() const { return index == 0; }
    476 
    477    bool isSameLine(LineToken other) const { return index == other.index; }
    478 
    479    void assertConsistentOffset(uint32_t offset) const {
    480      MOZ_ASSERT(offset_ == offset);
    481    }
    482  };
    483 
    484  /**
    485   * Compute a token usable to access information about the line at the
    486   * given offset.
    487   *
    488   * The only information directly accessible in a token is whether it
    489   * corresponds to the first line of source text (which may not be line
    490   * 1, depending on the |initialLineNumber| value used to construct
    491   * this).  Use |lineNumber(LineToken)| to compute the actual line
    492   * number (incorporating the contribution of |initialLineNumber|).
    493   */
    494  LineToken lineToken(uint32_t offset) const;
    495 
    496  /** Compute the line number for the given token. */
    497  uint32_t lineNumber(LineToken lineToken) const {
    498    return lineNumberFromIndex(lineToken.index);
    499  }
    500 
    501  /** Return the offset of the start of the line for |lineToken|. */
    502  uint32_t lineStart(LineToken lineToken) const {
    503    MOZ_ASSERT(lineToken.index + 1 < lineStartOffsets_.length(),
    504               "recorded line-start information must be available");
    505    return lineStartOffsets_[lineToken.index];
    506  }
    507 };
    508 
    509 enum class UnitsType : unsigned char {
    510  PossiblyMultiUnit = 0,
    511  GuaranteedSingleUnit = 1,
    512 };
    513 
    514 class ChunkInfo {
    515 private:
    516  // Column number offset in UTF-16 code units.
    517  // Store everything in |unsigned char|s so everything packs.
    518  unsigned char columnOffset_[sizeof(uint32_t)];
    519  unsigned char unitsType_;
    520 
    521 public:
    522  ChunkInfo(JS::ColumnNumberUnsignedOffset offset, UnitsType type)
    523      : unitsType_(static_cast<unsigned char>(type)) {
    524    memcpy(columnOffset_, offset.addressOfValueForTranscode(), sizeof(offset));
    525  }
    526 
    527  JS::ColumnNumberUnsignedOffset columnOffset() const {
    528    JS::ColumnNumberUnsignedOffset offset;
    529    memcpy(offset.addressOfValueForTranscode(), columnOffset_,
    530           sizeof(uint32_t));
    531    return offset;
    532  }
    533 
    534  UnitsType unitsType() const {
    535    MOZ_ASSERT(unitsType_ <= 1, "unitsType_ must be 0 or 1");
    536    return static_cast<UnitsType>(unitsType_);
    537  }
    538 
    539  void guaranteeSingleUnits() {
    540    MOZ_ASSERT(unitsType() == UnitsType::PossiblyMultiUnit,
    541               "should only be setting to possibly optimize from the "
    542               "pessimistic case");
    543    unitsType_ = static_cast<unsigned char>(UnitsType::GuaranteedSingleUnit);
    544  }
    545 };
    546 
    547 enum class InvalidEscapeType {
    548  // No invalid character escapes.
    549  None,
    550  // A malformed \x escape.
    551  Hexadecimal,
    552  // A malformed \u escape.
    553  Unicode,
    554  // An otherwise well-formed \u escape which represents a
    555  // codepoint > 10FFFF.
    556  UnicodeOverflow,
    557  // An octal escape in a template token.
    558  Octal,
    559  // NonOctalDecimalEscape - \8 or \9.
    560  EightOrNine
    561 };
    562 
    563 class TokenStreamAnyChars : public TokenStreamShared {
    564 private:
    565  // Constant-at-construction fields.
    566 
    567  FrontendContext* const fc;
    568 
    569  /** Options used for parsing/tokenizing. */
    570  const JS::ReadOnlyCompileOptions& options_;
    571 
    572  /**
    573   * Pointer used internally to test whether in strict mode.  Use |strictMode()|
    574   * instead of this field.
    575   */
    576  StrictModeGetter* const strictModeGetter_;
    577 
    578  /** Input filename or null. */
    579  JS::ConstUTF8CharsZ filename_;
    580 
    581  // Column number computation fields.
    582  // Used only for UTF-8 case.
    583 
    584  /**
    585   * A map of (line number => sequence of the column numbers at
    586   * |ColumnChunkLength|-unit boundaries rewound [if needed] to the nearest code
    587   * point boundary).  (|TokenStreamAnyChars::computeColumnOffset| is the sole
    588   * user of |ColumnChunkLength| and therefore contains its definition.)
    589   *
    590   * Entries appear in this map only when a column computation of sufficient
    591   * distance is performed on a line -- and only when the column is beyond the
    592   * first |ColumnChunkLength| units.  Each line's vector is lazily filled as
    593   * greater offsets require column computations.
    594   */
    595  mutable HashMap<uint32_t, Vector<ChunkInfo>> longLineColumnInfo_;
    596 
    597  // Computing accurate column numbers requires at *some* point linearly
    598  // iterating through prior source units in the line, to properly account for
    599  // multi-unit code points.  This is quadratic if counting happens repeatedly.
    600  //
    601  // But usually we need columns for advancing offsets through scripts.  By
    602  // caching the last ((line number, offset) => relative column) mapping (in
    603  // similar manner to how |SourceCoords::lastIndex_| is used to cache
    604  // (offset => line number) mappings) we can usually avoid re-iterating through
    605  // the common line prefix.
    606  //
    607  // Additionally, we avoid hash table lookup costs by caching the
    608  // |Vector<ChunkInfo>*| for the line of the last lookup.  (|nullptr| means we
    609  // must look it up -- or it hasn't been created yet.)  This pointer is nulled
    610  // when a lookup on a new line occurs, but as it's not a pointer at literal,
    611  // reallocatable element data, it's *not* invalidated when new entries are
    612  // added to such a vector.
    613 
    614  /**
    615   * The line in which the last column computation occurred, or UINT32_MAX if
    616   * no prior computation has yet happened.
    617   */
    618  mutable uint32_t lineOfLastColumnComputation_ = UINT32_MAX;
    619 
    620  /**
    621   * The chunk vector of the line for that last column computation.  This is
    622   * null if the chunk vector needs to be recalculated or initially created.
    623   */
    624  mutable Vector<ChunkInfo>* lastChunkVectorForLine_ = nullptr;
    625 
    626  /**
    627   * The offset (in code units) of the last column computation performed,
    628   * relative to source start.
    629   */
    630  mutable uint32_t lastOffsetOfComputedColumn_ = UINT32_MAX;
    631 
    632  /**
    633   * The column number offset from the 1st column for the offset (in code units)
    634   * of the last column computation performed, relative to source start.
    635   */
    636  mutable JS::ColumnNumberUnsignedOffset lastComputedColumnOffset_;
    637 
    638  // Intra-token fields.
    639 
    640  /**
    641   * The offset of the first invalid escape in a template literal.  (If there is
    642   * one -- if not, the value of this field is meaningless.)
    643   *
    644   * See also |invalidTemplateEscapeType|.
    645   */
    646  uint32_t invalidTemplateEscapeOffset = 0;
    647 
    648  /**
    649   * The type of the first invalid escape in a template literal.  (If there
    650   * isn't one, this will be |None|.)
    651   *
    652   * See also |invalidTemplateEscapeOffset|.
    653   */
    654  InvalidEscapeType invalidTemplateEscapeType = InvalidEscapeType::None;
    655 
    656  // Fields with values relevant across tokens (and therefore potentially across
    657  // function boundaries, such that lazy function parsing and stream-seeking
    658  // must take care in saving and restoring them).
    659 
    660  /** Line number and offset-to-line mapping information. */
    661  SourceCoords srcCoords;
    662 
    663  /** Circular token buffer of gotten tokens that have been ungotten. */
    664  Token tokens[ntokens] = {};
    665 
    666  /** The index in |tokens| of the last parsed token. */
    667  unsigned cursor_ = 0;
    668 
    669  /** The number of tokens in |tokens| available to be gotten. */
    670  unsigned lookahead = 0;
    671 
    672  /** The current line number. */
    673  unsigned lineno;
    674 
    675  /** Various flag bits (see above). */
    676  TokenStreamFlags flags = {};
    677 
    678  /** The offset of the start of the current line. */
    679  size_t linebase = 0;
    680 
    681  /** The start of the previous line, or |size_t(-1)| on the first line. */
    682  size_t prevLinebase = size_t(-1);
    683 
    684  /** The user's requested source URL.  Null if none has been set. */
    685  UniqueTwoByteChars displayURL_ = nullptr;
    686 
    687  /** The URL of the source map for this script.  Null if none has been set. */
    688  UniqueTwoByteChars sourceMapURL_ = nullptr;
    689 
    690  // Assorted boolean fields, none of which require maintenance across tokens,
    691  // stored at class end to minimize padding.
    692 
    693  /**
    694   * Whether syntax errors should or should not contain details about the
    695   * precise nature of the error.  (This is intended for use in suppressing
    696   * content-revealing details about syntax errors in cross-origin scripts on
    697   * the web.)
    698   */
    699  const bool mutedErrors;
    700 
    701  /**
    702   * An array storing whether a TokenKind observed while attempting to extend
    703   * a valid AssignmentExpression into an even longer AssignmentExpression
    704   * (e.g., extending '3' to '3 + 5') will terminate it without error.
    705   *
    706   * For example, ';' always ends an AssignmentExpression because it ends a
    707   * Statement or declaration.  '}' always ends an AssignmentExpression
    708   * because it terminates BlockStatement, FunctionBody, and embedded
    709   * expressions in TemplateLiterals.  Therefore both entries are set to true
    710   * in TokenStreamAnyChars construction.
    711   *
    712   * But e.g. '+' *could* extend an AssignmentExpression, so its entry here
    713   * is false.  Meanwhile 'this' can't extend an AssignmentExpression, but
    714   * it's only valid after a line break, so its entry here must be false.
    715   *
    716   * NOTE: This array could be static, but without C99's designated
    717   *       initializers it's easier zeroing here and setting the true entries
    718   *       in the constructor body.  (Having this per-instance might also aid
    719   *       locality.)  Don't worry!  Initialization time for each TokenStream
    720   *       is trivial.  See bug 639420.
    721   */
    722  bool isExprEnding[size_t(TokenKind::Limit)] = {};  // all-false initially
    723 
    724  // End of fields.
    725 
    726 public:
    727  TokenStreamAnyChars(FrontendContext* fc,
    728                      const JS::ReadOnlyCompileOptions& options,
    729                      StrictModeGetter* smg);
    730 
    731  template <typename Unit, class AnyCharsAccess>
    732  friend class GeneralTokenStreamChars;
    733  template <typename Unit, class AnyCharsAccess>
    734  friend class TokenStreamChars;
    735  template <typename Unit, class AnyCharsAccess>
    736  friend class TokenStreamSpecific;
    737 
    738  template <typename Unit>
    739  friend class TokenStreamPosition;
    740 
    741  // Accessors.
    742  unsigned cursor() const { return cursor_; }
    743  unsigned nextCursor() const { return (cursor_ + 1) & ntokensMask; }
    744  unsigned aheadCursor(unsigned steps) const {
    745    return (cursor_ + steps) & ntokensMask;
    746  }
    747 
    748  const Token& currentToken() const { return tokens[cursor()]; }
    749  bool isCurrentTokenType(TokenKind type) const {
    750    return currentToken().type == type;
    751  }
    752 
    753  [[nodiscard]] bool checkOptions();
    754 
    755 private:
    756  TaggedParserAtomIndex reservedWordToPropertyName(TokenKind tt) const;
    757 
    758 public:
    759  TaggedParserAtomIndex currentName() const {
    760    if (isCurrentTokenType(TokenKind::Name) ||
    761        isCurrentTokenType(TokenKind::PrivateName)) {
    762      return currentToken().name();
    763    }
    764 
    765    MOZ_ASSERT(TokenKindIsPossibleIdentifierName(currentToken().type));
    766    return reservedWordToPropertyName(currentToken().type);
    767  }
    768 
    769  bool currentNameHasEscapes(ParserAtomsTable& parserAtoms) const {
    770    if (isCurrentTokenType(TokenKind::Name) ||
    771        isCurrentTokenType(TokenKind::PrivateName)) {
    772      TokenPos pos = currentToken().pos;
    773      return (pos.end - pos.begin) != parserAtoms.length(currentToken().name());
    774    }
    775 
    776    MOZ_ASSERT(TokenKindIsPossibleIdentifierName(currentToken().type));
    777    return false;
    778  }
    779 
    780  bool isCurrentTokenAssignment() const {
    781    return TokenKindIsAssignment(currentToken().type);
    782  }
    783 
    784  // Flag methods.
    785  bool isEOF() const { return flags.isEOF; }
    786  bool hadError() const { return flags.hadError; }
    787 
    788  DeprecatedContent sawDeprecatedContent() const {
    789    return static_cast<DeprecatedContent>(flags.sawDeprecatedContent);
    790  }
    791 
    792 private:
    793  // Workaround GCC 7 sadness.
    794  void setSawDeprecatedContent(DeprecatedContent content) {
    795    flags.sawDeprecatedContent = static_cast<uint8_t>(content);
    796  }
    797 
    798 public:
    799  void clearSawDeprecatedContent() {
    800    setSawDeprecatedContent(DeprecatedContent::None);
    801  }
    802  void setSawDeprecatedOctalLiteral() {
    803    setSawDeprecatedContent(DeprecatedContent::OctalLiteral);
    804  }
    805  void setSawDeprecatedOctalEscape() {
    806    setSawDeprecatedContent(DeprecatedContent::OctalEscape);
    807  }
    808  void setSawDeprecatedEightOrNineEscape() {
    809    setSawDeprecatedContent(DeprecatedContent::EightOrNineEscape);
    810  }
    811 
    812  bool hasInvalidTemplateEscape() const {
    813    return invalidTemplateEscapeType != InvalidEscapeType::None;
    814  }
    815  void clearInvalidTemplateEscape() {
    816    invalidTemplateEscapeType = InvalidEscapeType::None;
    817  }
    818 
    819 private:
    820  // This is private because it should only be called by the tokenizer while
    821  // tokenizing not by, for example, BytecodeEmitter.
    822  bool strictMode() const {
    823    return strictModeGetter_ && strictModeGetter_->strictMode();
    824  }
    825 
    826  void setInvalidTemplateEscape(uint32_t offset, InvalidEscapeType type) {
    827    MOZ_ASSERT(type != InvalidEscapeType::None);
    828    if (invalidTemplateEscapeType != InvalidEscapeType::None) {
    829      return;
    830    }
    831    invalidTemplateEscapeOffset = offset;
    832    invalidTemplateEscapeType = type;
    833  }
    834 
    835 public:
    836  // Call this immediately after parsing an OrExpression to allow scanning the
    837  // next token with SlashIsRegExp without asserting (even though we just
    838  // peeked at it in SlashIsDiv mode).
    839  //
    840  // It's OK to disable the assertion because the places where this is called
    841  // have peeked at the next token in SlashIsDiv mode, and checked that it is
    842  // *not* a Div token.
    843  //
    844  // To see why it is necessary to disable the assertion, consider these two
    845  // programs:
    846  //
    847  //     x = arg => q       // per spec, this is all one statement, and the
    848  //     /a/g;              // slashes are division operators
    849  //
    850  //     x = arg => {}      // per spec, ASI at the end of this line
    851  //     /a/g;              // and that's a regexp literal
    852  //
    853  // The first program shows why orExpr() has use SlashIsDiv mode when peeking
    854  // ahead for the next operator after parsing `q`. The second program shows
    855  // why matchOrInsertSemicolon() must use SlashIsRegExp mode when scanning
    856  // ahead for a semicolon.
    857  void allowGettingNextTokenWithSlashIsRegExp() {
    858 #ifdef DEBUG
    859    // Check the precondition: Caller already peeked ahead at the next token,
    860    // in SlashIsDiv mode, and it is *not* a Div token.
    861    MOZ_ASSERT(hasLookahead());
    862    const Token& next = nextToken();
    863    MOZ_ASSERT(next.modifier == SlashIsDiv);
    864    MOZ_ASSERT(next.type != TokenKind::Div);
    865    tokens[nextCursor()].modifier = SlashIsRegExp;
    866 #endif
    867  }
    868 
    869 #ifdef DEBUG
    870  inline bool debugHasNoLookahead() const { return lookahead == 0; }
    871 #endif
    872 
    873  bool hasDisplayURL() const { return displayURL_ != nullptr; }
    874 
    875  char16_t* displayURL() { return displayURL_.get(); }
    876 
    877  bool hasSourceMapURL() const { return sourceMapURL_ != nullptr; }
    878 
    879  char16_t* sourceMapURL() { return sourceMapURL_.get(); }
    880 
    881  FrontendContext* context() const { return fc; }
    882 
    883  using LineToken = SourceCoords::LineToken;
    884 
    885  LineToken lineToken(uint32_t offset) const {
    886    return srcCoords.lineToken(offset);
    887  }
    888 
    889  uint32_t lineNumber(LineToken lineToken) const {
    890    return srcCoords.lineNumber(lineToken);
    891  }
    892 
    893  uint32_t lineStart(LineToken lineToken) const {
    894    return srcCoords.lineStart(lineToken);
    895  }
    896 
    897  /**
    898   * Fill in |err|.
    899   *
    900   * If the token stream doesn't have location info for this error, use the
    901   * caller's location (including line/column number) and return false.  (No
    902   * line of context is set.)
    903   *
    904   * Otherwise fill in everything in |err| except 1) line/column numbers and
    905   * 2) line-of-context-related fields and return true.  The caller *must*
    906   * fill in the line/column number; filling the line of context is optional.
    907   */
    908  bool fillExceptingContext(ErrorMetadata* err, uint32_t offset) const;
    909 
    910  MOZ_ALWAYS_INLINE void updateFlagsForEOL() { flags.isDirtyLine = false; }
    911 
    912 private:
    913  /**
    914   * Compute the column number offset from the 1st code unit in the line in
    915   * UTF-16 code units, for given absolute |offset| within source text on the
    916   * line of |lineToken| (which must have been computed from |offset|).
    917   *
    918   * A column number offset on a line that isn't the first line is just
    919   * the actual column number in 0-origin.  But a column number offset
    920   * on the first line is the column number offset from the initial
    921   * line/column of the script.  For example, consider this HTML with
    922   * line/column number keys:
    923   *
    924   *     Column number in 1-origin
    925   *                1         2            3
    926   *       123456789012345678901234   567890
    927   *
    928   *     Column number in 0-origin, and the offset from 1st column
    929   *                 1         2            3
    930   *       0123456789012345678901234   567890
    931   *     ------------------------------------
    932   *   1 | <html>
    933   *   2 | <head>
    934   *   3 |   <script>var x = 3;  x &lt; 4;
    935   *   4 | const y = 7;</script>
    936   *   5 | </head>
    937   *   6 | <body></body>
    938   *   7 | </html>
    939   *
    940   * The script would be compiled specifying initial (line, column) of (3, 10)
    941   * using |JS::ReadOnlyCompileOptions::{lineno,column}|, which is 0-origin.
    942   * And the column reported by |computeColumn| for the "v" of |var| would be
    943   * 11 (in 1-origin).  But the column number offset of the "v" in |var|, that
    944   * this function returns, would be 0.  On the other hand, the column reported
    945   * by |computeColumn| would be 1 (in 1-origin) and the column number offset
    946   * returned by this function for the "c" in |const| would be 0, because it's
    947   * not in the first line of source text.
    948   *
    949   * The column number offset is with respect *only* to the JavaScript source
    950   * text as SpiderMonkey sees it.  In the example, the "&lt;" is converted to
    951   * "<" by the browser before SpiderMonkey would see it.  So the column number
    952   * offset of the "4" in the inequality would be 16, not 19.
    953   *
    954   * UTF-16 code units are not all equal length in UTF-8 source, so counting
    955   * requires *some* kind of linear-time counting from the start of the line.
    956   * This function attempts various tricks to reduce this cost.  If these
    957   * optimizations succeed, repeated calls to this function on a line will pay
    958   * a one-time cost linear in the length of the line, then each call pays a
    959   * separate constant-time cost.  If the optimizations do not succeed, this
    960   * function works in time linear in the length of the line.
    961   *
    962   * It's unusual for a function in *this* class to be |Unit|-templated, but
    963   * while this operation manages |Unit|-agnostic fields in this class and in
    964   * |srcCoords|, it must *perform* |Unit|-sensitive computations to fill them.
    965   * And this is the best place to do that.
    966   */
    967  template <typename Unit>
    968  JS::ColumnNumberUnsignedOffset computeColumnOffset(
    969      const LineToken lineToken, const uint32_t offset,
    970      const SourceUnits<Unit>& sourceUnits) const;
    971 
    972  template <typename Unit>
    973  JS::ColumnNumberUnsignedOffset computeColumnOffsetForUTF8(
    974      const LineToken lineToken, const uint32_t offset, const uint32_t start,
    975      const uint32_t offsetInLine, const SourceUnits<Unit>& sourceUnits) const;
    976 
    977  /**
    978   * Update line/column information for the start of a new line at
    979   * |lineStartOffset|.
    980   */
    981  [[nodiscard]] MOZ_ALWAYS_INLINE bool internalUpdateLineInfoForEOL(
    982      uint32_t lineStartOffset);
    983 
    984 public:
    985  const Token& nextToken() const {
    986    MOZ_ASSERT(hasLookahead());
    987    return tokens[nextCursor()];
    988  }
    989 
    990  bool hasLookahead() const { return lookahead > 0; }
    991 
    992  void advanceCursor() { cursor_ = (cursor_ + 1) & ntokensMask; }
    993 
    994  void retractCursor() { cursor_ = (cursor_ - 1) & ntokensMask; }
    995 
    996  Token* allocateToken() {
    997    advanceCursor();
    998 
    999    Token* tp = &tokens[cursor()];
   1000    MOZ_MAKE_MEM_UNDEFINED(tp, sizeof(*tp));
   1001 
   1002    return tp;
   1003  }
   1004 
   1005  // Push the last scanned token back into the stream.
   1006  void ungetToken() {
   1007    MOZ_ASSERT(lookahead < maxLookahead);
   1008    lookahead++;
   1009    retractCursor();
   1010  }
   1011 
   1012 public:
   1013  void adoptState(TokenStreamAnyChars& other) {
   1014    // If |other| has fresh information from directives, overwrite any
   1015    // previously recorded directives.  (There is no specification directing
   1016    // that last-in-source-order directive controls, sadly.  We behave this way
   1017    // in the ordinary case, so we ought do so here too.)
   1018    if (auto& url = other.displayURL_) {
   1019      displayURL_ = std::move(url);
   1020    }
   1021    if (auto& url = other.sourceMapURL_) {
   1022      sourceMapURL_ = std::move(url);
   1023    }
   1024  }
   1025 
   1026  // Compute error metadata for an error at no offset.
   1027  void computeErrorMetadataNoOffset(ErrorMetadata* err) const;
   1028 
   1029  // ErrorReporter API Helpers
   1030 
   1031  // Provide minimal set of error reporting API given we cannot use
   1032  // ErrorReportMixin here. "report" prefix is added to avoid conflict with
   1033  // ErrorReportMixin methods in TokenStream class.
   1034  void reportErrorNoOffset(unsigned errorNumber, ...) const;
   1035  void reportErrorNoOffsetVA(unsigned errorNumber, va_list* args) const;
   1036 
   1037  const JS::ReadOnlyCompileOptions& options() const { return options_; }
   1038 
   1039  JS::ConstUTF8CharsZ getFilename() const { return filename_; }
   1040 };
   1041 
   1042 constexpr char16_t CodeUnitValue(char16_t unit) { return unit; }
   1043 
   1044 constexpr uint8_t CodeUnitValue(mozilla::Utf8Unit unit) {
   1045  return unit.toUint8();
   1046 }
   1047 
   1048 template <typename Unit>
   1049 class TokenStreamCharsBase;
   1050 
   1051 template <typename T>
   1052 inline bool IsLineTerminator(T) = delete;
   1053 
   1054 inline bool IsLineTerminator(char32_t codePoint) {
   1055  return codePoint == '\n' || codePoint == '\r' ||
   1056         codePoint == unicode::LINE_SEPARATOR ||
   1057         codePoint == unicode::PARA_SEPARATOR;
   1058 }
   1059 
   1060 inline bool IsLineTerminator(char16_t unit) {
   1061  // Every LineTerminator fits in char16_t, so this is exact.
   1062  return IsLineTerminator(static_cast<char32_t>(unit));
   1063 }
   1064 
   1065 template <typename Unit>
   1066 struct SourceUnitTraits;
   1067 
   1068 template <>
   1069 struct SourceUnitTraits<char16_t> {
   1070 public:
   1071  static constexpr uint8_t maxUnitsLength = 2;
   1072 
   1073  static constexpr size_t lengthInUnits(char32_t codePoint) {
   1074    return codePoint < unicode::NonBMPMin ? 1 : 2;
   1075  }
   1076 };
   1077 
   1078 template <>
   1079 struct SourceUnitTraits<mozilla::Utf8Unit> {
   1080 public:
   1081  static constexpr uint8_t maxUnitsLength = 4;
   1082 
   1083  static constexpr size_t lengthInUnits(char32_t codePoint) {
   1084    return codePoint < 0x80      ? 1
   1085           : codePoint < 0x800   ? 2
   1086           : codePoint < 0x10000 ? 3
   1087                                 : 4;
   1088  }
   1089 };
   1090 
   1091 /**
   1092 * PeekedCodePoint represents the result of peeking ahead in some source text
   1093 * to determine the next validly-encoded code point.
   1094 *
   1095 * If there isn't a valid code point, then |isNone()|.
   1096 *
   1097 * But if there *is* a valid code point, then |!isNone()|, the code point has
   1098 * value |codePoint()| and its length in code units is |lengthInUnits()|.
   1099 *
   1100 * Conceptually, this class is |Maybe<struct { char32_t v; uint8_t len; }>|.
   1101 */
   1102 template <typename Unit>
   1103 class PeekedCodePoint final {
   1104  char32_t codePoint_ = 0;
   1105  uint8_t lengthInUnits_ = 0;
   1106 
   1107 private:
   1108  using SourceUnitTraits = frontend::SourceUnitTraits<Unit>;
   1109 
   1110  PeekedCodePoint() = default;
   1111 
   1112 public:
   1113  /**
   1114   * Create a peeked code point with the given value and length in code
   1115   * units.
   1116   *
   1117   * While the latter value is computable from the former for both UTF-8 and
   1118   * JS's version of UTF-16, the caller likely computed a length in units in
   1119   * the course of determining the peeked value.  Passing both here avoids
   1120   * recomputation and lets us do a consistency-checking assertion.
   1121   */
   1122  PeekedCodePoint(char32_t codePoint, uint8_t lengthInUnits)
   1123      : codePoint_(codePoint), lengthInUnits_(lengthInUnits) {
   1124    MOZ_ASSERT(codePoint <= unicode::NonBMPMax);
   1125    MOZ_ASSERT(lengthInUnits != 0, "bad code point length");
   1126    MOZ_ASSERT(lengthInUnits == SourceUnitTraits::lengthInUnits(codePoint));
   1127  }
   1128 
   1129  /** Create a PeekedCodeUnit that represents no valid code point. */
   1130  static PeekedCodePoint none() { return PeekedCodePoint(); }
   1131 
   1132  /** True if no code point was found, false otherwise. */
   1133  bool isNone() const { return lengthInUnits_ == 0; }
   1134 
   1135  /** If a code point was found, its value. */
   1136  char32_t codePoint() const {
   1137    MOZ_ASSERT(!isNone());
   1138    return codePoint_;
   1139  }
   1140 
   1141  /** If a code point was found, its length in code units. */
   1142  uint8_t lengthInUnits() const {
   1143    MOZ_ASSERT(!isNone());
   1144    return lengthInUnits_;
   1145  }
   1146 };
   1147 
   1148 inline PeekedCodePoint<char16_t> PeekCodePoint(const char16_t* const ptr,
   1149                                               const char16_t* const end) {
   1150  if (MOZ_UNLIKELY(ptr >= end)) {
   1151    return PeekedCodePoint<char16_t>::none();
   1152  }
   1153 
   1154  char16_t lead = ptr[0];
   1155 
   1156  char32_t c;
   1157  uint8_t len;
   1158  if (MOZ_LIKELY(!unicode::IsLeadSurrogate(lead)) ||
   1159      MOZ_UNLIKELY(ptr + 1 >= end || !unicode::IsTrailSurrogate(ptr[1]))) {
   1160    c = lead;
   1161    len = 1;
   1162  } else {
   1163    c = unicode::UTF16Decode(lead, ptr[1]);
   1164    len = 2;
   1165  }
   1166 
   1167  return PeekedCodePoint<char16_t>(c, len);
   1168 }
   1169 
   1170 inline PeekedCodePoint<mozilla::Utf8Unit> PeekCodePoint(
   1171    const mozilla::Utf8Unit* const ptr, const mozilla::Utf8Unit* const end) {
   1172  if (MOZ_UNLIKELY(ptr >= end)) {
   1173    return PeekedCodePoint<mozilla::Utf8Unit>::none();
   1174  }
   1175 
   1176  const mozilla::Utf8Unit lead = ptr[0];
   1177  if (mozilla::IsAscii(lead)) {
   1178    return PeekedCodePoint<mozilla::Utf8Unit>(lead.toUint8(), 1);
   1179  }
   1180 
   1181  const mozilla::Utf8Unit* afterLead = ptr + 1;
   1182  mozilla::Maybe<char32_t> codePoint =
   1183      mozilla::DecodeOneUtf8CodePoint(lead, &afterLead, end);
   1184  if (codePoint.isNothing()) {
   1185    return PeekedCodePoint<mozilla::Utf8Unit>::none();
   1186  }
   1187 
   1188  auto len =
   1189      mozilla::AssertedCast<uint8_t>(mozilla::PointerRangeSize(ptr, afterLead));
   1190  MOZ_ASSERT(len <= 4);
   1191 
   1192  return PeekedCodePoint<mozilla::Utf8Unit>(codePoint.value(), len);
   1193 }
   1194 
   1195 inline bool IsSingleUnitLineTerminator(mozilla::Utf8Unit unit) {
   1196  // BEWARE: The Unicode line/paragraph separators don't fit in a single
   1197  //         UTF-8 code unit, so this test is exact for Utf8Unit but inexact
   1198  //         for UTF-8 as a whole.  Users must handle |unit| as start of a
   1199  //         Unicode LineTerminator themselves!
   1200  return unit == mozilla::Utf8Unit('\n') || unit == mozilla::Utf8Unit('\r');
   1201 }
   1202 
   1203 // This is the low-level interface to the JS source code buffer.  It just gets
   1204 // raw Unicode code units -- 16-bit char16_t units of source text that are not
   1205 // (always) full code points, and 8-bit units of UTF-8 source text soon.
   1206 // TokenStreams functions are layered on top and do some extra stuff like
   1207 // converting all EOL sequences to '\n', tracking the line number, and setting
   1208 // |flags.isEOF|.  (The "raw" in "raw Unicode code units" refers to the lack of
   1209 // EOL sequence normalization.)
   1210 //
   1211 // buf[0..length-1] often represents a substring of some larger source,
   1212 // where we have only the substring in memory. The |startOffset| argument
   1213 // indicates the offset within this larger string at which our string
   1214 // begins, the offset of |buf[0]|.
   1215 template <typename Unit>
   1216 class SourceUnits {
   1217 private:
   1218  /** Base of buffer. */
   1219  const Unit* base_;
   1220 
   1221  /** Offset of base_[0]. */
   1222  uint32_t startOffset_;
   1223 
   1224  /** Limit for quick bounds check. */
   1225  const Unit* limit_;
   1226 
   1227  /** Next char to get. */
   1228  const Unit* ptr;
   1229 
   1230 public:
   1231  SourceUnits(const Unit* units, size_t length, size_t startOffset)
   1232      : base_(units),
   1233        startOffset_(startOffset),
   1234        limit_(units + length),
   1235        ptr(units) {}
   1236 
   1237  bool atStart() const {
   1238    MOZ_ASSERT(!isPoisoned(), "shouldn't be using if poisoned");
   1239    return ptr == base_;
   1240  }
   1241 
   1242  bool atEnd() const {
   1243    MOZ_ASSERT(!isPoisoned(), "shouldn't be using if poisoned");
   1244    MOZ_ASSERT(ptr <= limit_, "shouldn't have overrun");
   1245    return ptr >= limit_;
   1246  }
   1247 
   1248  size_t remaining() const {
   1249    MOZ_ASSERT(!isPoisoned(),
   1250               "can't get a count of remaining code units if poisoned");
   1251    return mozilla::PointerRangeSize(ptr, limit_);
   1252  }
   1253 
   1254  size_t startOffset() const { return startOffset_; }
   1255 
   1256  size_t offset() const {
   1257    return startOffset_ + mozilla::PointerRangeSize(base_, ptr);
   1258  }
   1259 
   1260  const Unit* codeUnitPtrAt(size_t offset) const {
   1261    MOZ_ASSERT(!isPoisoned(), "shouldn't be using if poisoned");
   1262    MOZ_ASSERT(startOffset_ <= offset);
   1263    MOZ_ASSERT(offset - startOffset_ <=
   1264               mozilla::PointerRangeSize(base_, limit_));
   1265    return base_ + (offset - startOffset_);
   1266  }
   1267 
   1268  const Unit* current() const { return ptr; }
   1269 
   1270  const Unit* limit() const { return limit_; }
   1271 
   1272  Unit previousCodeUnit() {
   1273    MOZ_ASSERT(!isPoisoned(), "can't get previous code unit if poisoned");
   1274    MOZ_ASSERT(!atStart(), "must have a previous code unit to get");
   1275    return *(ptr - 1);
   1276  }
   1277 
   1278  MOZ_ALWAYS_INLINE Unit getCodeUnit() {
   1279    return *ptr++;  // this will nullptr-crash if poisoned
   1280  }
   1281 
   1282  Unit peekCodeUnit() const {
   1283    return *ptr;  // this will nullptr-crash if poisoned
   1284  }
   1285 
   1286  /**
   1287   * Determine the next code point in source text.  The code point is not
   1288   * normalized: '\r', '\n', '\u2028', and '\u2029' are returned literally.
   1289   * If there is no next code point because |atEnd()|, or if an encoding
   1290   * error is encountered, return a |PeekedCodePoint| that |isNone()|.
   1291   *
   1292   * This function does not report errors: code that attempts to get the next
   1293   * code point must report any error.
   1294   *
   1295   * If a next code point is found, it may be consumed by passing it to
   1296   * |consumeKnownCodePoint|.
   1297   */
   1298  PeekedCodePoint<Unit> peekCodePoint() const {
   1299    return PeekCodePoint(ptr, limit_);
   1300  }
   1301 
   1302 private:
   1303 #ifdef DEBUG
   1304  void assertNextCodePoint(const PeekedCodePoint<Unit>& peeked);
   1305 #endif
   1306 
   1307 public:
   1308  /**
   1309   * Consume a peeked code point that |!isNone()|.
   1310   *
   1311   * This call DOES NOT UPDATE LINE-STATUS.  You may need to call
   1312   * |updateLineInfoForEOL()| and |updateFlagsForEOL()| if this consumes a
   1313   * LineTerminator.  Note that if this consumes '\r', you also must consume
   1314   * an optional '\n' (i.e. a full LineTerminatorSequence) before doing so.
   1315   */
   1316  void consumeKnownCodePoint(const PeekedCodePoint<Unit>& peeked) {
   1317    MOZ_ASSERT(!peeked.isNone());
   1318    MOZ_ASSERT(peeked.lengthInUnits() <= remaining());
   1319 
   1320 #ifdef DEBUG
   1321    assertNextCodePoint(peeked);
   1322 #endif
   1323 
   1324    ptr += peeked.lengthInUnits();
   1325  }
   1326 
   1327  /** Match |n| hexadecimal digits and store their value in |*out|. */
   1328  bool matchHexDigits(uint8_t n, char16_t* out) {
   1329    MOZ_ASSERT(!isPoisoned(), "shouldn't peek into poisoned SourceUnits");
   1330    MOZ_ASSERT(n <= 4, "hexdigit value can't overflow char16_t");
   1331    if (n > remaining()) {
   1332      return false;
   1333    }
   1334 
   1335    char16_t v = 0;
   1336    for (uint8_t i = 0; i < n; i++) {
   1337      auto unit = CodeUnitValue(ptr[i]);
   1338      if (!mozilla::IsAsciiHexDigit(unit)) {
   1339        return false;
   1340      }
   1341 
   1342      v = (v << 4) | mozilla::AsciiAlphanumericToNumber(unit);
   1343    }
   1344 
   1345    *out = v;
   1346    ptr += n;
   1347    return true;
   1348  }
   1349 
   1350  bool matchCodeUnits(const char* chars, uint8_t length) {
   1351    MOZ_ASSERT(!isPoisoned(), "shouldn't match into poisoned SourceUnits");
   1352    if (length > remaining()) {
   1353      return false;
   1354    }
   1355 
   1356    const Unit* start = ptr;
   1357    const Unit* end = ptr + length;
   1358    while (ptr < end) {
   1359      if (*ptr++ != Unit(*chars++)) {
   1360        ptr = start;
   1361        return false;
   1362      }
   1363    }
   1364 
   1365    return true;
   1366  }
   1367 
   1368  void skipCodeUnits(uint32_t n) {
   1369    MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits");
   1370    MOZ_ASSERT(n <= remaining(), "shouldn't skip beyond end of SourceUnits");
   1371    ptr += n;
   1372  }
   1373 
   1374  void unskipCodeUnits(uint32_t n) {
   1375    MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits");
   1376    MOZ_ASSERT(n <= mozilla::PointerRangeSize(base_, ptr),
   1377               "shouldn't unskip beyond start of SourceUnits");
   1378    ptr -= n;
   1379  }
   1380 
   1381 private:
   1382  friend class TokenStreamCharsBase<Unit>;
   1383 
   1384  bool internalMatchCodeUnit(Unit c) {
   1385    MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits");
   1386    if (MOZ_LIKELY(!atEnd()) && *ptr == c) {
   1387      ptr++;
   1388      return true;
   1389    }
   1390    return false;
   1391  }
   1392 
   1393 public:
   1394  void consumeKnownCodeUnit(Unit c) {
   1395    MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits");
   1396    MOZ_ASSERT(*ptr == c, "consuming the wrong code unit");
   1397    ptr++;
   1398  }
   1399 
   1400  /** Unget U+2028 LINE SEPARATOR or U+2029 PARAGRAPH SEPARATOR. */
   1401  inline void ungetLineOrParagraphSeparator();
   1402 
   1403  void ungetCodeUnit() {
   1404    MOZ_ASSERT(!isPoisoned(), "can't unget from poisoned units");
   1405    MOZ_ASSERT(!atStart(), "can't unget if currently at start");
   1406    ptr--;
   1407  }
   1408 
   1409  const Unit* addressOfNextCodeUnit(bool allowPoisoned = false) const {
   1410    MOZ_ASSERT_IF(!allowPoisoned, !isPoisoned());
   1411    return ptr;
   1412  }
   1413 
   1414  // Use this with caution!
   1415  void setAddressOfNextCodeUnit(const Unit* a, bool allowPoisoned = false) {
   1416    MOZ_ASSERT_IF(!allowPoisoned, a);
   1417    ptr = a;
   1418  }
   1419 
   1420  // Poison the SourceUnits so they can't be accessed again.
   1421  void poisonInDebug() {
   1422 #ifdef DEBUG
   1423    ptr = nullptr;
   1424 #endif
   1425  }
   1426 
   1427 private:
   1428  bool isPoisoned() const {
   1429 #ifdef DEBUG
   1430    // |ptr| can be null for unpoisoned SourceUnits if this was initialized with
   1431    // |units == nullptr| and |length == 0|.  In that case, for lack of any
   1432    // better options, consider this to not be poisoned.
   1433    return ptr == nullptr && ptr != limit_;
   1434 #else
   1435    return false;
   1436 #endif
   1437  }
   1438 
   1439 public:
   1440  /**
   1441   * Consume the rest of a single-line comment (but not the EOL/EOF that
   1442   * terminates it).
   1443   *
   1444   * If an encoding error is encountered -- possible only for UTF-8 because
   1445   * JavaScript's conception of UTF-16 encompasses any sequence of 16-bit
   1446   * code units -- valid code points prior to the encoding error are consumed
   1447   * and subsequent invalid code units are not consumed.  For example, given
   1448   * these UTF-8 code units:
   1449   *
   1450   *   'B'   'A'  'D'  ':'   <bad code unit sequence>
   1451   *   0x42  0x41 0x44 0x3A  0xD0 0x00 ...
   1452   *
   1453   * the first four code units are consumed, but 0xD0 and 0x00 are not
   1454   * consumed because 0xD0 encodes a two-byte lead unit but 0x00 is not a
   1455   * valid trailing code unit.
   1456   *
   1457   * It is expected that the caller will report such an encoding error when
   1458   * it attempts to consume the next code point.
   1459   */
   1460  void consumeRestOfSingleLineComment();
   1461 
   1462  /**
   1463   * The maximum radius of code around the location of an error that should
   1464   * be included in a syntax error message -- this many code units to either
   1465   * side.  The resulting window of data is then accordinngly trimmed so that
   1466   * the window contains only validly-encoded data.
   1467   *
   1468   * Because this number is the same for both UTF-8 and UTF-16, windows in
   1469   * UTF-8 may contain fewer code points than windows in UTF-16.  As we only
   1470   * use this for error messages, we don't particularly care.
   1471   */
   1472  static constexpr size_t WindowRadius = ErrorMetadata::lineOfContextRadius;
   1473 
   1474  /**
   1475   * From absolute offset |offset|, search backward to find an absolute
   1476   * offset within source text, no further than |WindowRadius| code units
   1477   * away from |offset|, such that all code points from that offset to
   1478   * |offset| are valid, non-LineTerminator code points.
   1479   */
   1480  size_t findWindowStart(size_t offset) const;
   1481 
   1482  /**
   1483   * From absolute offset |offset|, find an absolute offset within source
   1484   * text, no further than |WindowRadius| code units away from |offset|, such
   1485   * that all code units from |offset| to that offset are valid,
   1486   * non-LineTerminator code points.
   1487   */
   1488  size_t findWindowEnd(size_t offset) const;
   1489 
   1490  /**
   1491   * Given a |window| of |encodingSpecificWindowLength| units encoding valid
   1492   * Unicode text, with index |encodingSpecificTokenOffset| indicating a
   1493   * particular code point boundary in |window|, compute the corresponding
   1494   * token offset and length if |window| were encoded in UTF-16.  For
   1495   * example:
   1496   *
   1497   *   // U+03C0 GREEK SMALL LETTER PI is encoded as 0xCF 0x80.
   1498   *   const Utf8Unit* encodedWindow =
   1499   *     reinterpret_cast<const Utf8Unit*>(u8"ππππ = @ FAIL");
   1500   *   size_t encodedTokenOffset = 11; // 2 * 4 + ' = '.length
   1501   *   size_t encodedWindowLength = 17; // 2 * 4 + ' = @ FAIL'.length
   1502   *   size_t utf16Offset, utf16Length;
   1503   *   computeWindowOffsetAndLength(encodedWindow,
   1504   *                                encodedTokenOffset, &utf16Offset,
   1505   *                                encodedWindowLength, &utf16Length);
   1506   *   MOZ_ASSERT(utf16Offset == 7);
   1507   *   MOZ_ASSERT(utf16Length = 13);
   1508   *
   1509   * This function asserts if called for UTF-16: the sole caller can avoid
   1510   * computing UTF-16 offsets when they're definitely the same as the encoded
   1511   * offsets.
   1512   */
   1513  inline void computeWindowOffsetAndLength(const Unit* encodeWindow,
   1514                                           size_t encodingSpecificTokenOffset,
   1515                                           size_t* utf16TokenOffset,
   1516                                           size_t encodingSpecificWindowLength,
   1517                                           size_t* utf16WindowLength) const;
   1518 };
   1519 
   1520 template <>
   1521 inline void SourceUnits<char16_t>::ungetLineOrParagraphSeparator() {
   1522 #ifdef DEBUG
   1523  char16_t prev = previousCodeUnit();
   1524 #endif
   1525  MOZ_ASSERT(prev == unicode::LINE_SEPARATOR ||
   1526             prev == unicode::PARA_SEPARATOR);
   1527 
   1528  ungetCodeUnit();
   1529 }
   1530 
   1531 template <>
   1532 inline void SourceUnits<mozilla::Utf8Unit>::ungetLineOrParagraphSeparator() {
   1533  unskipCodeUnits(3);
   1534 
   1535  MOZ_ASSERT(ptr[0].toUint8() == 0xE2);
   1536  MOZ_ASSERT(ptr[1].toUint8() == 0x80);
   1537 
   1538 #ifdef DEBUG
   1539  uint8_t last = ptr[2].toUint8();
   1540 #endif
   1541  MOZ_ASSERT(last == 0xA8 || last == 0xA9);
   1542 }
   1543 
   1544 /**
   1545 * An all-purpose buffer type for accumulating text during tokenizing.
   1546 *
   1547 * In principle we could make this buffer contain |char16_t|, |Utf8Unit|, or
   1548 * |Unit|.  We use |char16_t| because:
   1549 *
   1550 *   * we don't have a UTF-8 regular expression parser, so in general regular
   1551 *     expression text must be copied to a separate UTF-16 buffer to parse it,
   1552 *     and
   1553 *   * |TokenStreamCharsShared::copyCharBufferTo|, which copies a shared
   1554 *     |CharBuffer| to a |char16_t*|, is simpler if it doesn't have to convert.
   1555 */
   1556 using CharBuffer = Vector<char16_t, 32>;
   1557 
   1558 /**
   1559 * Append the provided code point (in the range [U+0000, U+10FFFF], surrogate
   1560 * code points included) to the buffer.
   1561 */
   1562 [[nodiscard]] extern bool AppendCodePointToCharBuffer(CharBuffer& charBuffer,
   1563                                                      char32_t codePoint);
   1564 
   1565 /**
   1566 * Accumulate the range of UTF-16 text (lone surrogates permitted, because JS
   1567 * allows them in source text) into |charBuffer|.  Normalize '\r', '\n', and
   1568 * "\r\n" into '\n'.
   1569 */
   1570 [[nodiscard]] extern bool FillCharBufferFromSourceNormalizingAsciiLineBreaks(
   1571    CharBuffer& charBuffer, const char16_t* cur, const char16_t* end);
   1572 
   1573 /**
   1574 * Accumulate the range of previously-validated UTF-8 text into |charBuffer|.
   1575 * Normalize '\r', '\n', and "\r\n" into '\n'.
   1576 */
   1577 [[nodiscard]] extern bool FillCharBufferFromSourceNormalizingAsciiLineBreaks(
   1578    CharBuffer& charBuffer, const mozilla::Utf8Unit* cur,
   1579    const mozilla::Utf8Unit* end);
   1580 
   1581 class TokenStreamCharsShared {
   1582 protected:
   1583  FrontendContext* fc;
   1584 
   1585  /**
   1586   * Buffer transiently used to store sequences of identifier or string code
   1587   * points when such can't be directly processed from the original source
   1588   * text (e.g. because it contains escapes).
   1589   */
   1590  CharBuffer charBuffer;
   1591 
   1592  /** Information for parsing with a lifetime longer than the parser itself. */
   1593  ParserAtomsTable* parserAtoms;
   1594 
   1595 protected:
   1596  explicit TokenStreamCharsShared(FrontendContext* fc,
   1597                                  ParserAtomsTable* parserAtoms)
   1598      : fc(fc), charBuffer(fc), parserAtoms(parserAtoms) {}
   1599 
   1600  [[nodiscard]] bool copyCharBufferTo(
   1601      UniquePtr<char16_t[], JS::FreePolicy>* destination);
   1602 
   1603  /**
   1604   * Determine whether a code unit constitutes a complete ASCII code point.
   1605   * (The code point's exact value might not be used, however, if subsequent
   1606   * code observes that |unit| is part of a LineTerminatorSequence.)
   1607   */
   1608  [[nodiscard]] static constexpr MOZ_ALWAYS_INLINE bool isAsciiCodePoint(
   1609      int32_t unit) {
   1610    return mozilla::IsAscii(static_cast<char32_t>(unit));
   1611  }
   1612 
   1613  TaggedParserAtomIndex drainCharBufferIntoAtom() {
   1614    // Add to parser atoms table.
   1615    auto atom = this->parserAtoms->internChar16(fc, charBuffer.begin(),
   1616                                                charBuffer.length());
   1617    charBuffer.clear();
   1618    return atom;
   1619  }
   1620 
   1621 protected:
   1622  void adoptState(TokenStreamCharsShared& other) {
   1623    // The other stream's buffer may contain information for a
   1624    // gotten-then-ungotten token, that we must transfer into this stream so
   1625    // that token's final get behaves as desired.
   1626    charBuffer = std::move(other.charBuffer);
   1627  }
   1628 
   1629 public:
   1630  CharBuffer& getCharBuffer() { return charBuffer; }
   1631 };
   1632 
   1633 template <typename Unit>
   1634 class TokenStreamCharsBase : public TokenStreamCharsShared {
   1635 protected:
   1636  using SourceUnits = frontend::SourceUnits<Unit>;
   1637 
   1638  /** Code units in the source code being tokenized. */
   1639  SourceUnits sourceUnits;
   1640 
   1641  // End of fields.
   1642 
   1643 protected:
   1644  TokenStreamCharsBase(FrontendContext* fc, ParserAtomsTable* parserAtoms,
   1645                       const Unit* units, size_t length, size_t startOffset);
   1646 
   1647  /**
   1648   * Convert a non-EOF code unit returned by |getCodeUnit()| or
   1649   * |peekCodeUnit()| to a Unit code unit.
   1650   */
   1651  inline Unit toUnit(int32_t codeUnitValue);
   1652 
   1653  void ungetCodeUnit(int32_t c) {
   1654    if (c == EOF) {
   1655      MOZ_ASSERT(sourceUnits.atEnd());
   1656      return;
   1657    }
   1658 
   1659    MOZ_ASSERT(sourceUnits.previousCodeUnit() == toUnit(c));
   1660    sourceUnits.ungetCodeUnit();
   1661  }
   1662 
   1663  MOZ_ALWAYS_INLINE TaggedParserAtomIndex
   1664  atomizeSourceChars(mozilla::Span<const Unit> units);
   1665 
   1666  /**
   1667   * Try to match a non-LineTerminator ASCII code point.  Return true iff it
   1668   * was matched.
   1669   */
   1670  bool matchCodeUnit(char expect) {
   1671    MOZ_ASSERT(mozilla::IsAscii(expect));
   1672    MOZ_ASSERT(expect != '\r');
   1673    MOZ_ASSERT(expect != '\n');
   1674    return this->sourceUnits.internalMatchCodeUnit(Unit(expect));
   1675  }
   1676 
   1677  /**
   1678   * Try to match an ASCII LineTerminator code point.  Return true iff it was
   1679   * matched.
   1680   */
   1681  MOZ_NEVER_INLINE bool matchLineTerminator(char expect) {
   1682    MOZ_ASSERT(expect == '\r' || expect == '\n');
   1683    return this->sourceUnits.internalMatchCodeUnit(Unit(expect));
   1684  }
   1685 
   1686  template <typename T>
   1687  bool matchCodeUnit(T) = delete;
   1688  template <typename T>
   1689  bool matchLineTerminator(T) = delete;
   1690 
   1691  int32_t peekCodeUnit() {
   1692    return MOZ_LIKELY(!sourceUnits.atEnd())
   1693               ? CodeUnitValue(sourceUnits.peekCodeUnit())
   1694               : EOF;
   1695  }
   1696 
   1697  /** Consume a known, non-EOF code unit. */
   1698  inline void consumeKnownCodeUnit(int32_t unit);
   1699 
   1700  // Forbid accidental calls to consumeKnownCodeUnit *not* with the single
   1701  // unit-or-EOF type.  Unit should use SourceUnits::consumeKnownCodeUnit;
   1702  // CodeUnitValue() results should go through toUnit(), or better yet just
   1703  // use the original Unit.
   1704  template <typename T>
   1705  inline void consumeKnownCodeUnit(T) = delete;
   1706 
   1707  /**
   1708   * Add a null-terminated line of context to error information, for the line
   1709   * in |sourceUnits| that contains |offset|.  Also record the window's
   1710   * length and the offset of the error in the window.  (Don't bother adding
   1711   * a line of context if it would be empty.)
   1712   *
   1713   * The window will contain no LineTerminators of any kind, and it will not
   1714   * extend more than |SourceUnits::WindowRadius| to either side of |offset|,
   1715   * nor into the previous or next lines.
   1716   *
   1717   * This function is quite internal, and you probably should be calling one
   1718   * of its existing callers instead.
   1719   */
   1720  [[nodiscard]] bool addLineOfContext(ErrorMetadata* err,
   1721                                      uint32_t offset) const;
   1722 };
   1723 
   1724 template <>
   1725 inline char16_t TokenStreamCharsBase<char16_t>::toUnit(int32_t codeUnitValue) {
   1726  MOZ_ASSERT(codeUnitValue != EOF, "EOF is not a Unit");
   1727  return mozilla::AssertedCast<char16_t>(codeUnitValue);
   1728 }
   1729 
   1730 template <>
   1731 inline mozilla::Utf8Unit TokenStreamCharsBase<mozilla::Utf8Unit>::toUnit(
   1732    int32_t value) {
   1733  MOZ_ASSERT(value != EOF, "EOF is not a Unit");
   1734  return mozilla::Utf8Unit(mozilla::AssertedCast<unsigned char>(value));
   1735 }
   1736 
   1737 template <typename Unit>
   1738 inline void TokenStreamCharsBase<Unit>::consumeKnownCodeUnit(int32_t unit) {
   1739  sourceUnits.consumeKnownCodeUnit(toUnit(unit));
   1740 }
   1741 
   1742 template <>
   1743 MOZ_ALWAYS_INLINE TaggedParserAtomIndex
   1744 TokenStreamCharsBase<char16_t>::atomizeSourceChars(
   1745    mozilla::Span<const char16_t> units) {
   1746  return this->parserAtoms->internChar16(fc, units.data(), units.size());
   1747 }
   1748 
   1749 template <>
   1750 /* static */ MOZ_ALWAYS_INLINE TaggedParserAtomIndex
   1751 TokenStreamCharsBase<mozilla::Utf8Unit>::atomizeSourceChars(
   1752    mozilla::Span<const mozilla::Utf8Unit> units) {
   1753  return this->parserAtoms->internUtf8(fc, units.data(), units.size());
   1754 }
   1755 
   1756 template <typename Unit>
   1757 class SpecializedTokenStreamCharsBase;
   1758 
   1759 template <>
   1760 class SpecializedTokenStreamCharsBase<char16_t>
   1761    : public TokenStreamCharsBase<char16_t> {
   1762  using CharsBase = TokenStreamCharsBase<char16_t>;
   1763 
   1764 protected:
   1765  using TokenStreamCharsShared::isAsciiCodePoint;
   1766  // Deliberately don't |using| |sourceUnits| because of bug 1472569.  :-(
   1767 
   1768  using typename CharsBase::SourceUnits;
   1769 
   1770 protected:
   1771  // These APIs are only usable by UTF-16-specific code.
   1772 
   1773  /**
   1774   * Given |lead| already consumed, consume and return the code point encoded
   1775   * starting from it.  Infallible because lone surrogates in JS encode a
   1776   * "code point" of the same value.
   1777   */
   1778  char32_t infallibleGetNonAsciiCodePointDontNormalize(char16_t lead) {
   1779    MOZ_ASSERT(!isAsciiCodePoint(lead));
   1780    MOZ_ASSERT(this->sourceUnits.previousCodeUnit() == lead);
   1781 
   1782    // Handle single-unit code points and lone trailing surrogates.
   1783    if (MOZ_LIKELY(!unicode::IsLeadSurrogate(lead)) ||
   1784        // Or handle lead surrogates not paired with trailing surrogates.
   1785        MOZ_UNLIKELY(
   1786            this->sourceUnits.atEnd() ||
   1787            !unicode::IsTrailSurrogate(this->sourceUnits.peekCodeUnit()))) {
   1788      return lead;
   1789    }
   1790 
   1791    // Otherwise it's a multi-unit code point.
   1792    return unicode::UTF16Decode(lead, this->sourceUnits.getCodeUnit());
   1793  }
   1794 
   1795 protected:
   1796  // These APIs are in both SpecializedTokenStreamCharsBase specializations
   1797  // and so are usable in subclasses no matter what Unit is.
   1798 
   1799  using CharsBase::CharsBase;
   1800 };
   1801 
   1802 template <>
   1803 class SpecializedTokenStreamCharsBase<mozilla::Utf8Unit>
   1804    : public TokenStreamCharsBase<mozilla::Utf8Unit> {
   1805  using CharsBase = TokenStreamCharsBase<mozilla::Utf8Unit>;
   1806 
   1807 protected:
   1808  // Deliberately don't |using| |sourceUnits| because of bug 1472569.  :-(
   1809 
   1810 protected:
   1811  // These APIs are only usable by UTF-8-specific code.
   1812 
   1813  using typename CharsBase::SourceUnits;
   1814 
   1815  /**
   1816   * A mutable iterator-wrapper around |SourceUnits| that translates
   1817   * operators to calls to |SourceUnits::getCodeUnit()| and similar.
   1818   *
   1819   * This class is expected to be used in concert with |SourceUnitsEnd|.
   1820   */
   1821  class SourceUnitsIterator {
   1822    SourceUnits& sourceUnits_;
   1823 #ifdef DEBUG
   1824    // In iterator copies created by the post-increment operator, a pointer
   1825    // at the next source text code unit when the post-increment operator
   1826    // was called, cleared when the iterator is dereferenced.
   1827    mutable mozilla::Maybe<const mozilla::Utf8Unit*>
   1828        currentBeforePostIncrement_;
   1829 #endif
   1830 
   1831   public:
   1832    explicit SourceUnitsIterator(SourceUnits& sourceUnits)
   1833        : sourceUnits_(sourceUnits) {}
   1834 
   1835    mozilla::Utf8Unit operator*() const {
   1836      // operator* is expected to get the *next* value from an iterator
   1837      // not pointing at the end of the underlying range.  However, the
   1838      // sole use of this is in the context of an expression of the form
   1839      // |*iter++|, that performed the |sourceUnits_.getCodeUnit()| in
   1840      // the |operator++(int)| below -- so dereferencing acts on a
   1841      // |sourceUnits_| already advanced.  Therefore the correct unit to
   1842      // return is the previous one.
   1843      MOZ_ASSERT(currentBeforePostIncrement_.value() + 1 ==
   1844                 sourceUnits_.current());
   1845 #ifdef DEBUG
   1846      currentBeforePostIncrement_.reset();
   1847 #endif
   1848      return sourceUnits_.previousCodeUnit();
   1849    }
   1850 
   1851    SourceUnitsIterator operator++(int) {
   1852      MOZ_ASSERT(currentBeforePostIncrement_.isNothing(),
   1853                 "the only valid operation on a post-incremented "
   1854                 "iterator is dereferencing a single time");
   1855 
   1856      SourceUnitsIterator copy = *this;
   1857 #ifdef DEBUG
   1858      copy.currentBeforePostIncrement_.emplace(sourceUnits_.current());
   1859 #endif
   1860 
   1861      sourceUnits_.getCodeUnit();
   1862      return copy;
   1863    }
   1864 
   1865    void operator-=(size_t n) {
   1866      MOZ_ASSERT(currentBeforePostIncrement_.isNothing(),
   1867                 "the only valid operation on a post-incremented "
   1868                 "iterator is dereferencing a single time");
   1869      sourceUnits_.unskipCodeUnits(n);
   1870    }
   1871 
   1872    mozilla::Utf8Unit operator[](ptrdiff_t index) {
   1873      MOZ_ASSERT(currentBeforePostIncrement_.isNothing(),
   1874                 "the only valid operation on a post-incremented "
   1875                 "iterator is dereferencing a single time");
   1876      MOZ_ASSERT(index == -1,
   1877                 "must only be called to verify the value of the "
   1878                 "previous code unit");
   1879      return sourceUnits_.previousCodeUnit();
   1880    }
   1881 
   1882    size_t remaining() const {
   1883      MOZ_ASSERT(currentBeforePostIncrement_.isNothing(),
   1884                 "the only valid operation on a post-incremented "
   1885                 "iterator is dereferencing a single time");
   1886      return sourceUnits_.remaining();
   1887    }
   1888  };
   1889 
   1890  /** A sentinel representing the end of |SourceUnits| data. */
   1891  class SourceUnitsEnd {};
   1892 
   1893  friend inline size_t operator-(const SourceUnitsEnd& aEnd,
   1894                                 const SourceUnitsIterator& aIter);
   1895 
   1896 protected:
   1897  // These APIs are in both SpecializedTokenStreamCharsBase specializations
   1898  // and so are usable in subclasses no matter what Unit is.
   1899 
   1900  using CharsBase::CharsBase;
   1901 };
   1902 
   1903 inline size_t operator-(const SpecializedTokenStreamCharsBase<
   1904                            mozilla::Utf8Unit>::SourceUnitsEnd& aEnd,
   1905                        const SpecializedTokenStreamCharsBase<
   1906                            mozilla::Utf8Unit>::SourceUnitsIterator& aIter) {
   1907  return aIter.remaining();
   1908 }
   1909 
   1910 /** A small class encapsulating computation of the start-offset of a Token. */
   1911 class TokenStart {
   1912  uint32_t startOffset_;
   1913 
   1914 public:
   1915  /**
   1916   * Compute a starting offset that is the current offset of |sourceUnits|,
   1917   * offset by |adjust|.  (For example, |adjust| of -1 indicates the code
   1918   * unit one backwards from |sourceUnits|'s current offset.)
   1919   */
   1920  template <class SourceUnits>
   1921  TokenStart(const SourceUnits& sourceUnits, ptrdiff_t adjust)
   1922      : startOffset_(sourceUnits.offset() + adjust) {}
   1923 
   1924  TokenStart(const TokenStart&) = default;
   1925 
   1926  uint32_t offset() const { return startOffset_; }
   1927 };
   1928 
   1929 template <typename Unit, class AnyCharsAccess>
   1930 class GeneralTokenStreamChars : public SpecializedTokenStreamCharsBase<Unit> {
   1931  using CharsBase = TokenStreamCharsBase<Unit>;
   1932  using SpecializedCharsBase = SpecializedTokenStreamCharsBase<Unit>;
   1933 
   1934  using LineToken = TokenStreamAnyChars::LineToken;
   1935 
   1936 private:
   1937  Token* newTokenInternal(TokenKind kind, TokenStart start, TokenKind* out);
   1938 
   1939  /**
   1940   * Allocates a new Token from the given offset to the current offset,
   1941   * ascribes it the given kind, and sets |*out| to that kind.
   1942   */
   1943  Token* newToken(TokenKind kind, TokenStart start,
   1944                  TokenStreamShared::Modifier modifier, TokenKind* out) {
   1945    Token* token = newTokenInternal(kind, start, out);
   1946 
   1947 #ifdef DEBUG
   1948    // Save the modifier used to get this token, so that if an ungetToken()
   1949    // occurs and then the token is re-gotten (or peeked, etc.), we can
   1950    // assert both gets used compatible modifiers.
   1951    token->modifier = modifier;
   1952 #endif
   1953 
   1954    return token;
   1955  }
   1956 
   1957  uint32_t matchUnicodeEscape(char32_t* codePoint);
   1958  uint32_t matchExtendedUnicodeEscape(char32_t* codePoint);
   1959 
   1960 protected:
   1961  using CharsBase::addLineOfContext;
   1962  using CharsBase::matchCodeUnit;
   1963  using CharsBase::matchLineTerminator;
   1964  using TokenStreamCharsShared::drainCharBufferIntoAtom;
   1965  using TokenStreamCharsShared::isAsciiCodePoint;
   1966  // Deliberately don't |using CharsBase::sourceUnits| because of bug 1472569.
   1967  // :-(
   1968  using CharsBase::toUnit;
   1969 
   1970  using typename CharsBase::SourceUnits;
   1971 
   1972 protected:
   1973  using SpecializedCharsBase::SpecializedCharsBase;
   1974 
   1975  TokenStreamAnyChars& anyCharsAccess() {
   1976    return AnyCharsAccess::anyChars(this);
   1977  }
   1978 
   1979  const TokenStreamAnyChars& anyCharsAccess() const {
   1980    return AnyCharsAccess::anyChars(this);
   1981  }
   1982 
   1983  using TokenStreamSpecific =
   1984      frontend::TokenStreamSpecific<Unit, AnyCharsAccess>;
   1985 
   1986  TokenStreamSpecific* asSpecific() {
   1987    static_assert(
   1988        std::is_base_of_v<GeneralTokenStreamChars, TokenStreamSpecific>,
   1989        "static_cast below presumes an inheritance relationship");
   1990 
   1991    return static_cast<TokenStreamSpecific*>(this);
   1992  }
   1993 
   1994 protected:
   1995  /**
   1996   * Compute the column number in Unicode code points of the absolute |offset|
   1997   * within source text on the line corresponding to |lineToken|.
   1998   *
   1999   * |offset| must be a code point boundary, preceded only by validly-encoded
   2000   * source units.  (It doesn't have to be *followed* by valid source units.)
   2001   */
   2002  JS::LimitedColumnNumberOneOrigin computeColumn(LineToken lineToken,
   2003                                                 uint32_t offset) const;
   2004  void computeLineAndColumn(uint32_t offset, uint32_t* line,
   2005                            JS::LimitedColumnNumberOneOrigin* column) const;
   2006 
   2007  /**
   2008   * Fill in |err| completely, except for line-of-context information.
   2009   *
   2010   * Return true if the caller can compute a line of context from the token
   2011   * stream.  Otherwise return false.
   2012   */
   2013  [[nodiscard]] bool fillExceptingContext(ErrorMetadata* err,
   2014                                          uint32_t offset) const {
   2015    if (anyCharsAccess().fillExceptingContext(err, offset)) {
   2016      JS::LimitedColumnNumberOneOrigin columnNumber;
   2017      computeLineAndColumn(offset, &err->lineNumber, &columnNumber);
   2018      err->columnNumber = JS::ColumnNumberOneOrigin(columnNumber);
   2019      return true;
   2020    }
   2021    return false;
   2022  }
   2023 
   2024  void newSimpleToken(TokenKind kind, TokenStart start,
   2025                      TokenStreamShared::Modifier modifier, TokenKind* out) {
   2026    newToken(kind, start, modifier, out);
   2027  }
   2028 
   2029  void newNumberToken(double dval, DecimalPoint decimalPoint, TokenStart start,
   2030                      TokenStreamShared::Modifier modifier, TokenKind* out) {
   2031    Token* token = newToken(TokenKind::Number, start, modifier, out);
   2032    token->setNumber(dval, decimalPoint);
   2033  }
   2034 
   2035  void newBigIntToken(TokenStart start, TokenStreamShared::Modifier modifier,
   2036                      TokenKind* out) {
   2037    newToken(TokenKind::BigInt, start, modifier, out);
   2038  }
   2039 
   2040  void newAtomToken(TokenKind kind, TaggedParserAtomIndex atom,
   2041                    TokenStart start, TokenStreamShared::Modifier modifier,
   2042                    TokenKind* out) {
   2043    MOZ_ASSERT(kind == TokenKind::String || kind == TokenKind::TemplateHead ||
   2044               kind == TokenKind::NoSubsTemplate);
   2045 
   2046    Token* token = newToken(kind, start, modifier, out);
   2047    token->setAtom(atom);
   2048  }
   2049 
   2050  void newNameToken(TaggedParserAtomIndex name, TokenStart start,
   2051                    TokenStreamShared::Modifier modifier, TokenKind* out) {
   2052    Token* token = newToken(TokenKind::Name, start, modifier, out);
   2053    token->setName(name);
   2054  }
   2055 
   2056  void newPrivateNameToken(TaggedParserAtomIndex name, TokenStart start,
   2057                           TokenStreamShared::Modifier modifier,
   2058                           TokenKind* out) {
   2059    Token* token = newToken(TokenKind::PrivateName, start, modifier, out);
   2060    token->setName(name);
   2061  }
   2062 
   2063  void newRegExpToken(JS::RegExpFlags reflags, TokenStart start,
   2064                      TokenKind* out) {
   2065    Token* token = newToken(TokenKind::RegExp, start,
   2066                            TokenStreamShared::SlashIsRegExp, out);
   2067    token->setRegExpFlags(reflags);
   2068  }
   2069 
   2070  MOZ_COLD bool badToken();
   2071 
   2072  /**
   2073   * Get the next code unit -- the next numeric sub-unit of source text,
   2074   * possibly smaller than a full code point -- without updating line/column
   2075   * counters or consuming LineTerminatorSequences.
   2076   *
   2077   * Because of these limitations, only use this if (a) the resulting code
   2078   * unit is guaranteed to be ungotten (by ungetCodeUnit()) if it's an EOL,
   2079   * and (b) the line-related state (lineno, linebase) is not used before
   2080   * it's ungotten.
   2081   */
   2082  int32_t getCodeUnit() {
   2083    if (MOZ_LIKELY(!this->sourceUnits.atEnd())) {
   2084      return CodeUnitValue(this->sourceUnits.getCodeUnit());
   2085    }
   2086 
   2087    anyCharsAccess().flags.isEOF = true;
   2088    return EOF;
   2089  }
   2090 
   2091  void ungetCodeUnit(int32_t c) {
   2092    MOZ_ASSERT_IF(c == EOF, anyCharsAccess().flags.isEOF);
   2093 
   2094    CharsBase::ungetCodeUnit(c);
   2095  }
   2096 
   2097  /**
   2098   * Given a just-consumed ASCII code unit/point |lead|, consume a full code
   2099   * point or LineTerminatorSequence (normalizing it to '\n'). Return true on
   2100   * success, otherwise return false.
   2101   *
   2102   * If a LineTerminatorSequence was consumed, also update line/column info.
   2103   *
   2104   * This may change the current |sourceUnits| offset.
   2105   */
   2106  [[nodiscard]] MOZ_ALWAYS_INLINE bool getFullAsciiCodePoint(int32_t lead) {
   2107    MOZ_ASSERT(isAsciiCodePoint(lead),
   2108               "non-ASCII code units must be handled separately");
   2109    MOZ_ASSERT(toUnit(lead) == this->sourceUnits.previousCodeUnit(),
   2110               "getFullAsciiCodePoint called incorrectly");
   2111 
   2112    if (MOZ_UNLIKELY(lead == '\r')) {
   2113      matchLineTerminator('\n');
   2114    } else if (MOZ_LIKELY(lead != '\n')) {
   2115      return true;
   2116    }
   2117    return updateLineInfoForEOL();
   2118  }
   2119 
   2120  [[nodiscard]] MOZ_NEVER_INLINE bool updateLineInfoForEOL() {
   2121    return anyCharsAccess().internalUpdateLineInfoForEOL(
   2122        this->sourceUnits.offset());
   2123  }
   2124 
   2125  uint32_t matchUnicodeEscapeIdStart(char32_t* codePoint);
   2126  bool matchUnicodeEscapeIdent(char32_t* codePoint);
   2127  bool matchIdentifierStart();
   2128 
   2129  /**
   2130   * If possible, compute a line of context for an otherwise-filled-in |err|
   2131   * at the given offset in this token stream.
   2132   *
   2133   * This function is very-internal: almost certainly you should use one of
   2134   * its callers instead.  It basically exists only to make those callers
   2135   * more readable.
   2136   */
   2137  [[nodiscard]] bool internalComputeLineOfContext(ErrorMetadata* err,
   2138                                                  uint32_t offset) const {
   2139    // We only have line-start information for the current line.  If the error
   2140    // is on a different line, we can't easily provide context.  (This means
   2141    // any error in a multi-line token, e.g. an unterminated multiline string
   2142    // literal, won't have context.)
   2143    if (err->lineNumber != anyCharsAccess().lineno) {
   2144      return true;
   2145    }
   2146 
   2147    return addLineOfContext(err, offset);
   2148  }
   2149 
   2150 public:
   2151  /**
   2152   * Consume any hashbang comment at the start of a Script or Module, if one is
   2153   * present.  Stops consuming just before any terminating LineTerminator or
   2154   * before an encoding error is encountered.
   2155   */
   2156  void consumeOptionalHashbangComment();
   2157 
   2158  TaggedParserAtomIndex getRawTemplateStringAtom() {
   2159    TokenStreamAnyChars& anyChars = anyCharsAccess();
   2160 
   2161    MOZ_ASSERT(anyChars.currentToken().type == TokenKind::TemplateHead ||
   2162               anyChars.currentToken().type == TokenKind::NoSubsTemplate);
   2163    const Unit* cur =
   2164        this->sourceUnits.codeUnitPtrAt(anyChars.currentToken().pos.begin + 1);
   2165    const Unit* end;
   2166    if (anyChars.currentToken().type == TokenKind::TemplateHead) {
   2167      // Of the form    |`...${|   or   |}...${|
   2168      end =
   2169          this->sourceUnits.codeUnitPtrAt(anyChars.currentToken().pos.end - 2);
   2170    } else {
   2171      // NoSubsTemplate is of the form   |`...`|   or   |}...`|
   2172      end =
   2173          this->sourceUnits.codeUnitPtrAt(anyChars.currentToken().pos.end - 1);
   2174    }
   2175 
   2176    // |charBuffer| should be empty here, but we may as well code defensively.
   2177    MOZ_ASSERT(this->charBuffer.length() == 0);
   2178    this->charBuffer.clear();
   2179 
   2180    // Template literals normalize only '\r' and "\r\n" to '\n'; Unicode
   2181    // separators don't need special handling.
   2182    // https://tc39.github.io/ecma262/#sec-static-semantics-tv-and-trv
   2183    if (!FillCharBufferFromSourceNormalizingAsciiLineBreaks(this->charBuffer,
   2184                                                            cur, end)) {
   2185      return TaggedParserAtomIndex::null();
   2186    }
   2187 
   2188    return drainCharBufferIntoAtom();
   2189  }
   2190 };
   2191 
   2192 template <typename Unit, class AnyCharsAccess>
   2193 class TokenStreamChars;
   2194 
   2195 template <class AnyCharsAccess>
   2196 class TokenStreamChars<char16_t, AnyCharsAccess>
   2197    : public GeneralTokenStreamChars<char16_t, AnyCharsAccess> {
   2198  using CharsBase = TokenStreamCharsBase<char16_t>;
   2199  using SpecializedCharsBase = SpecializedTokenStreamCharsBase<char16_t>;
   2200  using GeneralCharsBase = GeneralTokenStreamChars<char16_t, AnyCharsAccess>;
   2201  using Self = TokenStreamChars<char16_t, AnyCharsAccess>;
   2202 
   2203  using GeneralCharsBase::asSpecific;
   2204 
   2205  using typename GeneralCharsBase::TokenStreamSpecific;
   2206 
   2207 protected:
   2208  using CharsBase::matchLineTerminator;
   2209  using GeneralCharsBase::anyCharsAccess;
   2210  using GeneralCharsBase::getCodeUnit;
   2211  using SpecializedCharsBase::infallibleGetNonAsciiCodePointDontNormalize;
   2212  using TokenStreamCharsShared::isAsciiCodePoint;
   2213  // Deliberately don't |using| |sourceUnits| because of bug 1472569.  :-(
   2214  using GeneralCharsBase::ungetCodeUnit;
   2215  using GeneralCharsBase::updateLineInfoForEOL;
   2216 
   2217 protected:
   2218  using GeneralCharsBase::GeneralCharsBase;
   2219 
   2220  /**
   2221   * Given the non-ASCII |lead| code unit just consumed, consume and return a
   2222   * complete non-ASCII code point.  Line/column updates are not performed,
   2223   * and line breaks are returned as-is without normalization.
   2224   */
   2225  [[nodiscard]] bool getNonAsciiCodePointDontNormalize(char16_t lead,
   2226                                                       char32_t* codePoint) {
   2227    // There are no encoding errors in 16-bit JS, so implement this so that
   2228    // the compiler knows it, too.
   2229    *codePoint = infallibleGetNonAsciiCodePointDontNormalize(lead);
   2230    return true;
   2231  }
   2232 
   2233  /**
   2234   * Given a just-consumed non-ASCII code unit |lead| (which may also be a
   2235   * full code point, for UTF-16), consume a full code point or
   2236   * LineTerminatorSequence (normalizing it to '\n') and store it in
   2237   * |*codePoint|.  Return true on success, otherwise return false and leave
   2238   * |*codePoint| undefined on failure.
   2239   *
   2240   * If a LineTerminatorSequence was consumed, also update line/column info.
   2241   *
   2242   * This may change the current |sourceUnits| offset.
   2243   */
   2244  [[nodiscard]] bool getNonAsciiCodePoint(int32_t lead, char32_t* codePoint);
   2245 };
   2246 
   2247 template <class AnyCharsAccess>
   2248 class TokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess>
   2249    : public GeneralTokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess> {
   2250  using CharsBase = TokenStreamCharsBase<mozilla::Utf8Unit>;
   2251  using SpecializedCharsBase =
   2252      SpecializedTokenStreamCharsBase<mozilla::Utf8Unit>;
   2253  using GeneralCharsBase =
   2254      GeneralTokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess>;
   2255  using Self = TokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess>;
   2256 
   2257  using typename SpecializedCharsBase::SourceUnitsEnd;
   2258  using typename SpecializedCharsBase::SourceUnitsIterator;
   2259 
   2260 protected:
   2261  using GeneralCharsBase::anyCharsAccess;
   2262  using GeneralCharsBase::computeLineAndColumn;
   2263  using GeneralCharsBase::fillExceptingContext;
   2264  using GeneralCharsBase::internalComputeLineOfContext;
   2265  using TokenStreamCharsShared::isAsciiCodePoint;
   2266  // Deliberately don't |using| |sourceUnits| because of bug 1472569.  :-(
   2267  using GeneralCharsBase::updateLineInfoForEOL;
   2268 
   2269 private:
   2270  static char toHexChar(uint8_t nibble) {
   2271    MOZ_ASSERT(nibble < 16);
   2272    return "0123456789ABCDEF"[nibble];
   2273  }
   2274 
   2275  static void byteToString(uint8_t n, char* str) {
   2276    str[0] = '0';
   2277    str[1] = 'x';
   2278    str[2] = toHexChar(n >> 4);
   2279    str[3] = toHexChar(n & 0xF);
   2280  }
   2281 
   2282  static void byteToTerminatedString(uint8_t n, char* str) {
   2283    byteToString(n, str);
   2284    str[4] = '\0';
   2285  }
   2286 
   2287  /**
   2288   * Report a UTF-8 encoding-related error for a code point starting AT THE
   2289   * CURRENT OFFSET.
   2290   *
   2291   * |relevantUnits| indicates how many code units from the current offset
   2292   * are potentially relevant to the reported error, such that they may be
   2293   * included in the error message.  For example, if at the current offset we
   2294   * have
   2295   *
   2296   *   0b1111'1111 ...
   2297   *
   2298   * a code unit never allowed in UTF-8, then |relevantUnits| might be 1
   2299   * because only that unit is relevant.  Or if we have
   2300   *
   2301   *   0b1111'0111 0b1011'0101 0b0000'0000 ...
   2302   *
   2303   * where the first two code units are a valid prefix to a four-unit code
   2304   * point but the third unit *isn't* a valid trailing code unit, then
   2305   * |relevantUnits| might be 3.
   2306   */
   2307  MOZ_COLD void internalEncodingError(uint8_t relevantUnits,
   2308                                      unsigned errorNumber, ...);
   2309 
   2310  // Don't use |internalEncodingError|!  Use one of the elaborated functions
   2311  // that calls it, below -- all of which should be used to indicate an error
   2312  // in a code point starting AT THE CURRENT OFFSET as with
   2313  // |internalEncodingError|.
   2314 
   2315  /** Report an error for an invalid lead code unit |lead|. */
   2316  MOZ_COLD void badLeadUnit(mozilla::Utf8Unit lead);
   2317 
   2318  /**
   2319   * Report an error when there aren't enough code units remaining to
   2320   * constitute a full code point after |lead|: only |remaining| code units
   2321   * were available for a code point starting with |lead|, when at least
   2322   * |required| code units were required.
   2323   */
   2324  MOZ_COLD void notEnoughUnits(mozilla::Utf8Unit lead, uint8_t remaining,
   2325                               uint8_t required);
   2326 
   2327  /**
   2328   * Report an error for a bad trailing UTF-8 code unit, where the bad
   2329   * trailing unit was the last of |unitsObserved| units examined from the
   2330   * current offset.
   2331   */
   2332  MOZ_COLD void badTrailingUnit(uint8_t unitsObserved);
   2333 
   2334  // Helper used for both |badCodePoint| and |notShortestForm| for code units
   2335  // that have all the requisite high bits set/unset in a manner that *could*
   2336  // encode a valid code point, but the remaining bits encoding its actual
   2337  // value do not define a permitted value.
   2338  MOZ_COLD void badStructurallyValidCodePoint(char32_t codePoint,
   2339                                              uint8_t codePointLength,
   2340                                              const char* reason);
   2341 
   2342  /**
   2343   * Report an error for UTF-8 that encodes a UTF-16 surrogate or a number
   2344   * outside the Unicode range.
   2345   */
   2346  MOZ_COLD void badCodePoint(char32_t codePoint, uint8_t codePointLength) {
   2347    MOZ_ASSERT(unicode::IsSurrogate(codePoint) ||
   2348               codePoint > unicode::NonBMPMax);
   2349 
   2350    badStructurallyValidCodePoint(codePoint, codePointLength,
   2351                                  unicode::IsSurrogate(codePoint)
   2352                                      ? "it's a UTF-16 surrogate"
   2353                                      : "the maximum code point is U+10FFFF");
   2354  }
   2355 
   2356  /**
   2357   * Report an error for UTF-8 that encodes a code point not in its shortest
   2358   * form.
   2359   */
   2360  MOZ_COLD void notShortestForm(char32_t codePoint, uint8_t codePointLength) {
   2361    MOZ_ASSERT(!unicode::IsSurrogate(codePoint));
   2362    MOZ_ASSERT(codePoint <= unicode::NonBMPMax);
   2363 
   2364    badStructurallyValidCodePoint(
   2365        codePoint, codePointLength,
   2366        "it wasn't encoded in shortest possible form");
   2367  }
   2368 
   2369 protected:
   2370  using GeneralCharsBase::GeneralCharsBase;
   2371 
   2372  /**
   2373   * Given the non-ASCII |lead| code unit just consumed, consume the rest of
   2374   * a non-ASCII code point.  The code point is not normalized: on success
   2375   * |*codePoint| may be U+2028 LINE SEPARATOR or U+2029 PARAGRAPH SEPARATOR.
   2376   *
   2377   * Report an error if an invalid code point is encountered.
   2378   */
   2379  [[nodiscard]] bool getNonAsciiCodePointDontNormalize(mozilla::Utf8Unit lead,
   2380                                                       char32_t* codePoint);
   2381 
   2382  /**
   2383   * Given a just-consumed non-ASCII code unit |lead|, consume a full code
   2384   * point or LineTerminatorSequence (normalizing it to '\n') and store it in
   2385   * |*codePoint|.  Return true on success, otherwise return false and leave
   2386   * |*codePoint| undefined on failure.
   2387   *
   2388   * If a LineTerminatorSequence was consumed, also update line/column info.
   2389   *
   2390   * This function will change the current |sourceUnits| offset.
   2391   */
   2392  [[nodiscard]] bool getNonAsciiCodePoint(int32_t lead, char32_t* codePoint);
   2393 };
   2394 
   2395 // TokenStream is the lexical scanner for JavaScript source text.
   2396 //
   2397 // It takes a buffer of Unit code units (currently only char16_t encoding
   2398 // UTF-16, but we're adding either UTF-8 or Latin-1 single-byte text soon) and
   2399 // linearly scans it into |Token|s.
   2400 //
   2401 // Internally the class uses a four element circular buffer |tokens| of
   2402 // |Token|s. As an index for |tokens|, the member |cursor_| points to the
   2403 // current token. Calls to getToken() increase |cursor_| by one and return the
   2404 // new current token. If a TokenStream was just created, the current token is
   2405 // uninitialized. It's therefore important that one of the first four member
   2406 // functions listed below is called first. The circular buffer lets us go back
   2407 // up to two tokens from the last scanned token. Internally, the relative
   2408 // number of backward steps that were taken (via ungetToken()) after the last
   2409 // token was scanned is stored in |lookahead|.
   2410 //
   2411 // The following table lists in which situations it is safe to call each listed
   2412 // function. No checks are made by the functions in non-debug builds.
   2413 //
   2414 // Function Name     | Precondition; changes to |lookahead|
   2415 // ------------------+---------------------------------------------------------
   2416 // getToken          | none; if |lookahead > 0| then |lookahead--|
   2417 // peekToken         | none; if |lookahead == 0| then |lookahead == 1|
   2418 // peekTokenSameLine | none; if |lookahead == 0| then |lookahead == 1|
   2419 // matchToken        | none; if |lookahead > 0| and the match succeeds then
   2420 //                   |       |lookahead--|
   2421 // consumeKnownToken | none; if |lookahead > 0| then |lookahead--|
   2422 // ungetToken        | 0 <= |lookahead| <= |maxLookahead - 1|; |lookahead++|
   2423 //
   2424 // The behavior of the token scanning process (see getTokenInternal()) can be
   2425 // modified by calling one of the first four above listed member functions with
   2426 // an optional argument of type Modifier.  However, the modifier will be
   2427 // ignored unless |lookahead == 0| holds.  Due to constraints of the grammar,
   2428 // this turns out not to be a problem in practice. See the
   2429 // mozilla.dev.tech.js-engine.internals thread entitled 'Bug in the scanner?'
   2430 // for more details:
   2431 // https://groups.google.com/forum/?fromgroups=#!topic/mozilla.dev.tech.js-engine.internals/2JLH5jRcr7E).
   2432 //
   2433 // The method seek() allows rescanning from a previously visited location of
   2434 // the buffer, initially computed by constructing a Position local variable.
   2435 //
   2436 template <typename Unit, class AnyCharsAccess>
   2437 class MOZ_STACK_CLASS TokenStreamSpecific
   2438    : public TokenStreamChars<Unit, AnyCharsAccess>,
   2439      public TokenStreamShared,
   2440      public ErrorReporter {
   2441 public:
   2442  using CharsBase = TokenStreamCharsBase<Unit>;
   2443  using SpecializedCharsBase = SpecializedTokenStreamCharsBase<Unit>;
   2444  using GeneralCharsBase = GeneralTokenStreamChars<Unit, AnyCharsAccess>;
   2445  using SpecializedChars = TokenStreamChars<Unit, AnyCharsAccess>;
   2446 
   2447  using Position = TokenStreamPosition<Unit>;
   2448 
   2449  // Anything inherited through a base class whose type depends upon this
   2450  // class's template parameters can only be accessed through a dependent
   2451  // name: prefixed with |this|, by explicit qualification, and so on.  (This
   2452  // is so that references to inherited fields are statically distinguishable
   2453  // from references to names outside of the class.)  This is tedious and
   2454  // onerous.
   2455  //
   2456  // As an alternative, we directly add every one of these functions to this
   2457  // class, using explicit qualification to address the dependent-name
   2458  // problem.  |this| or other qualification is no longer necessary -- at
   2459  // cost of this ever-changing laundry list of |using|s.  So it goes.
   2460 public:
   2461  using GeneralCharsBase::anyCharsAccess;
   2462  using GeneralCharsBase::computeLineAndColumn;
   2463  using TokenStreamCharsShared::adoptState;
   2464 
   2465 private:
   2466  using typename CharsBase::SourceUnits;
   2467 
   2468 private:
   2469  using CharsBase::atomizeSourceChars;
   2470  using GeneralCharsBase::badToken;
   2471  // Deliberately don't |using| |charBuffer| because of bug 1472569.  :-(
   2472  using CharsBase::consumeKnownCodeUnit;
   2473  using CharsBase::matchCodeUnit;
   2474  using CharsBase::matchLineTerminator;
   2475  using CharsBase::peekCodeUnit;
   2476  using GeneralCharsBase::computeColumn;
   2477  using GeneralCharsBase::fillExceptingContext;
   2478  using GeneralCharsBase::getCodeUnit;
   2479  using GeneralCharsBase::getFullAsciiCodePoint;
   2480  using GeneralCharsBase::internalComputeLineOfContext;
   2481  using GeneralCharsBase::matchUnicodeEscapeIdent;
   2482  using GeneralCharsBase::matchUnicodeEscapeIdStart;
   2483  using GeneralCharsBase::newAtomToken;
   2484  using GeneralCharsBase::newBigIntToken;
   2485  using GeneralCharsBase::newNameToken;
   2486  using GeneralCharsBase::newNumberToken;
   2487  using GeneralCharsBase::newPrivateNameToken;
   2488  using GeneralCharsBase::newRegExpToken;
   2489  using GeneralCharsBase::newSimpleToken;
   2490  using SpecializedChars::getNonAsciiCodePoint;
   2491  using SpecializedChars::getNonAsciiCodePointDontNormalize;
   2492  using TokenStreamCharsShared::copyCharBufferTo;
   2493  using TokenStreamCharsShared::drainCharBufferIntoAtom;
   2494  using TokenStreamCharsShared::isAsciiCodePoint;
   2495  // Deliberately don't |using| |sourceUnits| because of bug 1472569.  :-(
   2496  using CharsBase::toUnit;
   2497  using GeneralCharsBase::ungetCodeUnit;
   2498  using GeneralCharsBase::updateLineInfoForEOL;
   2499 
   2500  template <typename CharU>
   2501  friend class TokenStreamPosition;
   2502 
   2503 public:
   2504  TokenStreamSpecific(FrontendContext* fc, ParserAtomsTable* parserAtoms,
   2505                      const JS::ReadOnlyCompileOptions& options,
   2506                      const Unit* units, size_t length);
   2507 
   2508  /**
   2509   * Get the next code point, converting LineTerminatorSequences to '\n' and
   2510   * updating internal line-counter state if needed. Return true on success.
   2511   * Return false on failure.
   2512   */
   2513  [[nodiscard]] MOZ_ALWAYS_INLINE bool getCodePoint() {
   2514    int32_t unit = getCodeUnit();
   2515    if (MOZ_UNLIKELY(unit == EOF)) {
   2516      MOZ_ASSERT(anyCharsAccess().flags.isEOF,
   2517                 "flags.isEOF should have been set by getCodeUnit()");
   2518      return true;
   2519    }
   2520 
   2521    if (isAsciiCodePoint(unit)) {
   2522      return getFullAsciiCodePoint(unit);
   2523    }
   2524 
   2525    char32_t cp;
   2526    return getNonAsciiCodePoint(unit, &cp);
   2527  }
   2528 
   2529  // If there is an invalid escape in a template, report it and return false,
   2530  // otherwise return true.
   2531  bool checkForInvalidTemplateEscapeError() {
   2532    if (anyCharsAccess().invalidTemplateEscapeType == InvalidEscapeType::None) {
   2533      return true;
   2534    }
   2535 
   2536    reportInvalidEscapeError(anyCharsAccess().invalidTemplateEscapeOffset,
   2537                             anyCharsAccess().invalidTemplateEscapeType);
   2538    return false;
   2539  }
   2540 
   2541 public:
   2542  // Implement ErrorReporter.
   2543 
   2544  std::optional<bool> isOnThisLine(size_t offset,
   2545                                   uint32_t lineNum) const final {
   2546    return anyCharsAccess().srcCoords.isOnThisLine(offset, lineNum);
   2547  }
   2548 
   2549  uint32_t lineAt(size_t offset) const final {
   2550    const auto& anyChars = anyCharsAccess();
   2551    auto lineToken = anyChars.lineToken(offset);
   2552    return anyChars.lineNumber(lineToken);
   2553  }
   2554 
   2555  JS::LimitedColumnNumberOneOrigin columnAt(size_t offset) const final {
   2556    return computeColumn(anyCharsAccess().lineToken(offset), offset);
   2557  }
   2558 
   2559 private:
   2560  // Implement ErrorReportMixin.
   2561 
   2562  FrontendContext* getContext() const override {
   2563    return anyCharsAccess().context();
   2564  }
   2565 
   2566  [[nodiscard]] bool strictMode() const override {
   2567    return anyCharsAccess().strictMode();
   2568  }
   2569 
   2570 public:
   2571  // Implement ErrorReportMixin.
   2572 
   2573  const JS::ReadOnlyCompileOptions& options() const final {
   2574    return anyCharsAccess().options();
   2575  }
   2576 
   2577  [[nodiscard]] bool computeErrorMetadata(
   2578      ErrorMetadata* err, const ErrorOffset& errorOffset) const override;
   2579 
   2580 private:
   2581  void reportInvalidEscapeError(uint32_t offset, InvalidEscapeType type) {
   2582    switch (type) {
   2583      case InvalidEscapeType::None:
   2584        MOZ_ASSERT_UNREACHABLE("unexpected InvalidEscapeType");
   2585        return;
   2586      case InvalidEscapeType::Hexadecimal:
   2587        errorAt(offset, JSMSG_MALFORMED_ESCAPE, "hexadecimal");
   2588        return;
   2589      case InvalidEscapeType::Unicode:
   2590        errorAt(offset, JSMSG_MALFORMED_ESCAPE, "Unicode");
   2591        return;
   2592      case InvalidEscapeType::UnicodeOverflow:
   2593        errorAt(offset, JSMSG_UNICODE_OVERFLOW, "escape sequence");
   2594        return;
   2595      case InvalidEscapeType::Octal:
   2596        errorAt(offset, JSMSG_DEPRECATED_OCTAL_ESCAPE);
   2597        return;
   2598      case InvalidEscapeType::EightOrNine:
   2599        errorAt(offset, JSMSG_DEPRECATED_EIGHT_OR_NINE_ESCAPE);
   2600        return;
   2601    }
   2602  }
   2603 
   2604  void reportIllegalCharacter(int32_t cp);
   2605 
   2606  [[nodiscard]] bool putIdentInCharBuffer(const Unit* identStart);
   2607 
   2608  using IsIntegerUnit = bool (*)(int32_t);
   2609  [[nodiscard]] MOZ_ALWAYS_INLINE bool matchInteger(IsIntegerUnit isIntegerUnit,
   2610                                                    int32_t* nextUnit);
   2611  [[nodiscard]] MOZ_ALWAYS_INLINE bool matchIntegerAfterFirstDigit(
   2612      IsIntegerUnit isIntegerUnit, int32_t* nextUnit);
   2613 
   2614  /**
   2615   * Tokenize a decimal number that begins at |numStart| into the provided
   2616   * token.
   2617   *
   2618   * |unit| must be one of these values:
   2619   *
   2620   *   1. The first decimal digit in the integral part of a decimal number
   2621   *      not starting with '.', e.g. '1' for "17", '0' for "0.14", or
   2622   *      '8' for "8.675309e6".
   2623   *
   2624   *   In this case, the next |getCodeUnit()| must return the code unit after
   2625   *   |unit| in the overall number.
   2626   *
   2627   *   2. The '.' in a "."-prefixed decimal number, e.g. ".17" or ".1e3".
   2628   *
   2629   *   In this case, the next |getCodeUnit()| must return the code unit
   2630   *   *after* the '.'.
   2631   *
   2632   *   3. (Non-strict mode code only)  The first non-ASCII-digit unit for a
   2633   *      "noctal" number that begins with a '0' but contains a non-octal digit
   2634   *      in its integer part so is interpreted as decimal, e.g. '.' in "09.28"
   2635   *      or EOF for "0386" or '+' in "09+7" (three separate tokens).
   2636   *
   2637   *   In this case, the next |getCodeUnit()| returns the code unit after
   2638   *   |unit|: '2', 'EOF', or '7' in the examples above.
   2639   *
   2640   * This interface is super-hairy and horribly stateful.  Unfortunately, its
   2641   * hair merely reflects the intricacy of ECMAScript numeric literal syntax.
   2642   * And incredibly, it *improves* on the goto-based horror that predated it.
   2643   */
   2644  [[nodiscard]] bool decimalNumber(int32_t unit, TokenStart start,
   2645                                   const Unit* numStart, Modifier modifier,
   2646                                   TokenKind* out);
   2647 
   2648  /** Tokenize a regular expression literal beginning at |start|. */
   2649  [[nodiscard]] bool regexpLiteral(TokenStart start, TokenKind* out);
   2650 
   2651  /**
   2652   * Slurp characters between |start| and sourceUnits.current() into
   2653   * charBuffer, to later parse into a bigint.
   2654   */
   2655  [[nodiscard]] bool bigIntLiteral(TokenStart start, Modifier modifier,
   2656                                   TokenKind* out);
   2657 
   2658 public:
   2659  // Advance to the next token.  If the token stream encountered an error,
   2660  // return false.  Otherwise return true and store the token kind in |*ttp|.
   2661  [[nodiscard]] bool getToken(TokenKind* ttp, Modifier modifier = SlashIsDiv) {
   2662    // Check for a pushed-back token resulting from mismatching lookahead.
   2663    TokenStreamAnyChars& anyChars = anyCharsAccess();
   2664    if (anyChars.lookahead != 0) {
   2665      MOZ_ASSERT(!anyChars.flags.hadError);
   2666      anyChars.lookahead--;
   2667      anyChars.advanceCursor();
   2668      TokenKind tt = anyChars.currentToken().type;
   2669      MOZ_ASSERT(tt != TokenKind::Eol);
   2670      verifyConsistentModifier(modifier, anyChars.currentToken());
   2671      *ttp = tt;
   2672      return true;
   2673    }
   2674 
   2675    return getTokenInternal(ttp, modifier);
   2676  }
   2677 
   2678  [[nodiscard]] bool peekToken(TokenKind* ttp, Modifier modifier = SlashIsDiv) {
   2679    TokenStreamAnyChars& anyChars = anyCharsAccess();
   2680    if (anyChars.lookahead > 0) {
   2681      MOZ_ASSERT(!anyChars.flags.hadError);
   2682      verifyConsistentModifier(modifier, anyChars.nextToken());
   2683      *ttp = anyChars.nextToken().type;
   2684      return true;
   2685    }
   2686    if (!getTokenInternal(ttp, modifier)) {
   2687      return false;
   2688    }
   2689    anyChars.ungetToken();
   2690    return true;
   2691  }
   2692 
   2693  [[nodiscard]] bool peekTokenPos(TokenPos* posp,
   2694                                  Modifier modifier = SlashIsDiv) {
   2695    TokenStreamAnyChars& anyChars = anyCharsAccess();
   2696    if (anyChars.lookahead == 0) {
   2697      TokenKind tt;
   2698      if (!getTokenInternal(&tt, modifier)) {
   2699        return false;
   2700      }
   2701      anyChars.ungetToken();
   2702      MOZ_ASSERT(anyChars.hasLookahead());
   2703    } else {
   2704      MOZ_ASSERT(!anyChars.flags.hadError);
   2705      verifyConsistentModifier(modifier, anyChars.nextToken());
   2706    }
   2707    *posp = anyChars.nextToken().pos;
   2708    return true;
   2709  }
   2710 
   2711  [[nodiscard]] bool peekOffset(uint32_t* offset,
   2712                                Modifier modifier = SlashIsDiv) {
   2713    TokenPos pos;
   2714    if (!peekTokenPos(&pos, modifier)) {
   2715      return false;
   2716    }
   2717    *offset = pos.begin;
   2718    return true;
   2719  }
   2720 
   2721  // This is like peekToken(), with one exception:  if there is an EOL
   2722  // between the end of the current token and the start of the next token, it
   2723  // return true and store Eol in |*ttp|.  In that case, no token with
   2724  // Eol is actually created, just a Eol TokenKind is returned, and
   2725  // currentToken() shouldn't be consulted.  (This is the only place Eol
   2726  // is produced.)
   2727  [[nodiscard]] MOZ_ALWAYS_INLINE bool peekTokenSameLine(
   2728      TokenKind* ttp, Modifier modifier = SlashIsDiv) {
   2729    TokenStreamAnyChars& anyChars = anyCharsAccess();
   2730    const Token& curr = anyChars.currentToken();
   2731 
   2732    // If lookahead != 0, we have scanned ahead at least one token, and
   2733    // |lineno| is the line that the furthest-scanned token ends on.  If
   2734    // it's the same as the line that the current token ends on, that's a
   2735    // stronger condition than what we are looking for, and we don't need
   2736    // to return Eol.
   2737    if (anyChars.lookahead != 0) {
   2738      std::optional<bool> onThisLineStatus =
   2739          anyChars.srcCoords.isOnThisLine(curr.pos.end, anyChars.lineno);
   2740      if (!onThisLineStatus.has_value()) {
   2741        error(JSMSG_OUT_OF_MEMORY);
   2742        return false;
   2743      }
   2744 
   2745      bool onThisLine = *onThisLineStatus;
   2746      if (onThisLine) {
   2747        MOZ_ASSERT(!anyChars.flags.hadError);
   2748        verifyConsistentModifier(modifier, anyChars.nextToken());
   2749        *ttp = anyChars.nextToken().type;
   2750        return true;
   2751      }
   2752    }
   2753 
   2754    // The above check misses two cases where we don't have to return
   2755    // Eol.
   2756    // - The next token starts on the same line, but is a multi-line token.
   2757    // - The next token starts on the same line, but lookahead==2 and there
   2758    //   is a newline between the next token and the one after that.
   2759    // The following test is somewhat expensive but gets these cases (and
   2760    // all others) right.
   2761    TokenKind tmp;
   2762    if (!getToken(&tmp, modifier)) {
   2763      return false;
   2764    }
   2765 
   2766    const Token& next = anyChars.currentToken();
   2767    anyChars.ungetToken();
   2768 
   2769    // Careful, |next| points to an initialized-but-not-allocated Token!
   2770    // This is safe because we don't modify token data below.
   2771 
   2772    auto currentEndToken = anyChars.lineToken(curr.pos.end);
   2773    auto nextBeginToken = anyChars.lineToken(next.pos.begin);
   2774 
   2775    *ttp =
   2776        currentEndToken.isSameLine(nextBeginToken) ? next.type : TokenKind::Eol;
   2777    return true;
   2778  }
   2779 
   2780  // Get the next token from the stream if its kind is |tt|.
   2781  [[nodiscard]] bool matchToken(bool* matchedp, TokenKind tt,
   2782                                Modifier modifier = SlashIsDiv) {
   2783    TokenKind token;
   2784    if (!getToken(&token, modifier)) {
   2785      return false;
   2786    }
   2787    if (token == tt) {
   2788      *matchedp = true;
   2789    } else {
   2790      anyCharsAccess().ungetToken();
   2791      *matchedp = false;
   2792    }
   2793    return true;
   2794  }
   2795 
   2796  void consumeKnownToken(TokenKind tt, Modifier modifier = SlashIsDiv) {
   2797    bool matched;
   2798    MOZ_ASSERT(anyCharsAccess().hasLookahead());
   2799    MOZ_ALWAYS_TRUE(matchToken(&matched, tt, modifier));
   2800    MOZ_ALWAYS_TRUE(matched);
   2801  }
   2802 
   2803  [[nodiscard]] bool nextTokenEndsExpr(bool* endsExpr) {
   2804    TokenKind tt;
   2805    if (!peekToken(&tt)) {
   2806      return false;
   2807    }
   2808 
   2809    *endsExpr = anyCharsAccess().isExprEnding[size_t(tt)];
   2810    if (*endsExpr) {
   2811      // If the next token ends an overall Expression, we'll parse this
   2812      // Expression without ever invoking Parser::orExpr().  But we need that
   2813      // function's DEBUG-only side effect of marking this token as safe to get
   2814      // with SlashIsRegExp, so we have to do it manually here.
   2815      anyCharsAccess().allowGettingNextTokenWithSlashIsRegExp();
   2816    }
   2817    return true;
   2818  }
   2819 
   2820  [[nodiscard]] bool advance(size_t position);
   2821 
   2822  void seekTo(const Position& pos);
   2823  [[nodiscard]] bool seekTo(const Position& pos,
   2824                            const TokenStreamAnyChars& other);
   2825 
   2826  void rewind(const Position& pos) {
   2827    MOZ_ASSERT(pos.buf <= this->sourceUnits.addressOfNextCodeUnit(),
   2828               "should be rewinding here");
   2829    seekTo(pos);
   2830  }
   2831 
   2832  [[nodiscard]] bool rewind(const Position& pos,
   2833                            const TokenStreamAnyChars& other) {
   2834    MOZ_ASSERT(pos.buf <= this->sourceUnits.addressOfNextCodeUnit(),
   2835               "should be rewinding here");
   2836    return seekTo(pos, other);
   2837  }
   2838 
   2839  void fastForward(const Position& pos) {
   2840    MOZ_ASSERT(this->sourceUnits.addressOfNextCodeUnit() <= pos.buf,
   2841               "should be moving forward here");
   2842    seekTo(pos);
   2843  }
   2844 
   2845  [[nodiscard]] bool fastForward(const Position& pos,
   2846                                 const TokenStreamAnyChars& other) {
   2847    MOZ_ASSERT(this->sourceUnits.addressOfNextCodeUnit() <= pos.buf,
   2848               "should be moving forward here");
   2849    return seekTo(pos, other);
   2850  }
   2851 
   2852  const Unit* codeUnitPtrAt(size_t offset) const {
   2853    return this->sourceUnits.codeUnitPtrAt(offset);
   2854  }
   2855 
   2856  [[nodiscard]] bool identifierName(TokenStart start, const Unit* identStart,
   2857                                    IdentifierEscapes escaping,
   2858                                    Modifier modifier,
   2859                                    NameVisibility visibility, TokenKind* out);
   2860 
   2861  [[nodiscard]] bool matchIdentifierStart(IdentifierEscapes* sawEscape);
   2862 
   2863  [[nodiscard]] bool getTokenInternal(TokenKind* const ttp,
   2864                                      const Modifier modifier);
   2865 
   2866  [[nodiscard]] bool getStringOrTemplateToken(char untilChar, Modifier modifier,
   2867                                              TokenKind* out);
   2868 
   2869  // Parse a TemplateMiddle or TemplateTail token (one of the string-like parts
   2870  // of a template string) after already consuming the leading `RightCurly`.
   2871  // (The spec says the `}` is the first character of the TemplateMiddle/
   2872  // TemplateTail, but we treat it as a separate token because that's much
   2873  // easier to implement in both TokenStream and the parser.)
   2874  //
   2875  // This consumes a token and sets the current token, like `getToken()`.  It
   2876  // doesn't take a Modifier because there's no risk of encountering a division
   2877  // operator or RegExp literal.
   2878  //
   2879  // On success, `*ttp` is either `TokenKind::TemplateHead` (if we got a
   2880  // TemplateMiddle token) or `TokenKind::NoSubsTemplate` (if we got a
   2881  // TemplateTail). That may seem strange; there are four different template
   2882  // token types in the spec, but we only use two. We use `TemplateHead` for
   2883  // TemplateMiddle because both end with `...${`, and `NoSubsTemplate` for
   2884  // TemplateTail because both contain the end of the template, including the
   2885  // closing quote mark. They're not treated differently, either in the parser
   2886  // or in the tokenizer.
   2887  [[nodiscard]] bool getTemplateToken(TokenKind* ttp) {
   2888    MOZ_ASSERT(anyCharsAccess().currentToken().type == TokenKind::RightCurly);
   2889    return getStringOrTemplateToken('`', SlashIsInvalid, ttp);
   2890  }
   2891 
   2892  [[nodiscard]] bool getDirectives(bool isMultiline, bool shouldWarnDeprecated);
   2893  [[nodiscard]] bool getDirective(
   2894      bool isMultiline, bool shouldWarnDeprecated, const char* directive,
   2895      uint8_t directiveLength, const char* errorMsgPragma,
   2896      UniquePtr<char16_t[], JS::FreePolicy>* destination);
   2897  [[nodiscard]] bool getDisplayURL(bool isMultiline, bool shouldWarnDeprecated);
   2898  [[nodiscard]] bool getSourceMappingURL(bool isMultiline,
   2899                                         bool shouldWarnDeprecated);
   2900 };
   2901 
   2902 // It's preferable to define this in TokenStream.cpp, but its template-ness
   2903 // means we'd then have to *instantiate* this constructor for all possible
   2904 // (Unit, AnyCharsAccess) pairs -- and that gets super-messy as AnyCharsAccess
   2905 // *itself* is templated.  This symbol really isn't that huge compared to some
   2906 // defined inline in TokenStreamSpecific, so just rely on the linker commoning
   2907 // stuff up.
   2908 template <typename Unit>
   2909 template <class AnyCharsAccess>
   2910 inline TokenStreamPosition<Unit>::TokenStreamPosition(
   2911    TokenStreamSpecific<Unit, AnyCharsAccess>& tokenStream)
   2912    : currentToken(tokenStream.anyCharsAccess().currentToken()) {
   2913  TokenStreamAnyChars& anyChars = tokenStream.anyCharsAccess();
   2914 
   2915  buf =
   2916      tokenStream.sourceUnits.addressOfNextCodeUnit(/* allowPoisoned = */ true);
   2917  flags = anyChars.flags;
   2918  lineno = anyChars.lineno;
   2919  linebase = anyChars.linebase;
   2920  prevLinebase = anyChars.prevLinebase;
   2921  lookahead = anyChars.lookahead;
   2922  currentToken = anyChars.currentToken();
   2923  for (unsigned i = 0; i < anyChars.lookahead; i++) {
   2924    lookaheadTokens[i] = anyChars.tokens[anyChars.aheadCursor(1 + i)];
   2925  }
   2926 }
   2927 
   2928 class TokenStreamAnyCharsAccess {
   2929 public:
   2930  template <class TokenStreamSpecific>
   2931  static inline TokenStreamAnyChars& anyChars(TokenStreamSpecific* tss);
   2932 
   2933  template <class TokenStreamSpecific>
   2934  static inline const TokenStreamAnyChars& anyChars(
   2935      const TokenStreamSpecific* tss);
   2936 };
   2937 
   2938 class MOZ_STACK_CLASS TokenStream
   2939    : public TokenStreamAnyChars,
   2940      public TokenStreamSpecific<char16_t, TokenStreamAnyCharsAccess> {
   2941  using Unit = char16_t;
   2942 
   2943 public:
   2944  TokenStream(FrontendContext* fc, ParserAtomsTable* parserAtoms,
   2945              const JS::ReadOnlyCompileOptions& options, const Unit* units,
   2946              size_t length, StrictModeGetter* smg)
   2947      : TokenStreamAnyChars(fc, options, smg),
   2948        TokenStreamSpecific<Unit, TokenStreamAnyCharsAccess>(
   2949            fc, parserAtoms, options, units, length) {}
   2950 };
   2951 
   2952 class MOZ_STACK_CLASS DummyTokenStream final : public TokenStream {
   2953 public:
   2954  DummyTokenStream(FrontendContext* fc,
   2955                   const JS::ReadOnlyCompileOptions& options)
   2956      : TokenStream(fc, nullptr, options, nullptr, 0, nullptr) {}
   2957 };
   2958 
   2959 template <class TokenStreamSpecific>
   2960 /* static */ inline TokenStreamAnyChars& TokenStreamAnyCharsAccess::anyChars(
   2961    TokenStreamSpecific* tss) {
   2962  auto* ts = static_cast<TokenStream*>(tss);
   2963  return *static_cast<TokenStreamAnyChars*>(ts);
   2964 }
   2965 
   2966 template <class TokenStreamSpecific>
   2967 /* static */ inline const TokenStreamAnyChars&
   2968 TokenStreamAnyCharsAccess::anyChars(const TokenStreamSpecific* tss) {
   2969  const auto* ts = static_cast<const TokenStream*>(tss);
   2970  return *static_cast<const TokenStreamAnyChars*>(ts);
   2971 }
   2972 
   2973 extern const char* TokenKindToDesc(TokenKind tt);
   2974 
   2975 }  // namespace frontend
   2976 }  // namespace js
   2977 
   2978 #ifdef DEBUG
   2979 extern const char* TokenKindToString(js::frontend::TokenKind tt);
   2980 #endif
   2981 
   2982 #endif /* frontend_TokenStream_h */