TokenStream.h (109642B)
1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- 2 * vim: set ts=8 sts=2 et sw=2 tw=80: 3 * This Source Code Form is subject to the terms of the Mozilla Public 4 * License, v. 2.0. If a copy of the MPL was not distributed with this 5 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ 6 7 /* 8 * Streaming access to the raw tokens of JavaScript source. 9 * 10 * Because JS tokenization is context-sensitive -- a '/' could be either a 11 * regular expression *or* a division operator depending on context -- the 12 * various token stream classes are mostly not useful outside of the Parser 13 * where they reside. We should probably eventually merge the two concepts. 14 */ 15 #ifndef frontend_TokenStream_h 16 #define frontend_TokenStream_h 17 18 /* 19 * [SMDOC] Parser Token Stream 20 * 21 * A token stream exposes the raw tokens -- operators, names, numbers, 22 * keywords, and so on -- of JavaScript source code. 23 * 24 * These are the components of the overall token stream concept: 25 * TokenStreamShared, TokenStreamAnyChars, TokenStreamCharsBase<Unit>, 26 * TokenStreamChars<Unit>, and TokenStreamSpecific<Unit, AnyCharsAccess>. 27 * 28 * == TokenStreamShared → ∅ == 29 * 30 * Certain aspects of tokenizing are used everywhere: 31 * 32 * * modifiers (used to select which context-sensitive interpretation of a 33 * character should be used to decide what token it is) and modifier 34 * assertion handling; 35 * * flags on the overall stream (have we encountered any characters on this 36 * line? have we hit a syntax error? and so on); 37 * * and certain token-count constants. 38 * 39 * These are all defined in TokenStreamShared. (They could be namespace- 40 * scoped, but it seems tentatively better not to clutter the namespace.) 41 * 42 * == TokenStreamAnyChars → TokenStreamShared == 43 * 44 * Certain aspects of tokenizing have meaning independent of the character type 45 * of the source text being tokenized: line/column number information, tokens 46 * in lookahead from determining the meaning of a prior token, compilation 47 * options, the filename, flags, source map URL, access to details of the 48 * current and next tokens (is the token of the given type? what name or 49 * number is contained in the token? and other queries), and others. 50 * 51 * All this data/functionality *could* be duplicated for both single-byte and 52 * double-byte tokenizing, but there are two problems. First, it's potentially 53 * wasteful if the compiler doesnt recognize it can unify the concepts. (And 54 * if any-character concepts are intermixed with character-specific concepts, 55 * potentially the compiler *can't* unify them because offsets into the 56 * hypothetical TokenStream<Unit>s would differ.) Second, some of this stuff 57 * needs to be accessible in ParserBase, the aspects of JS language parsing 58 * that have meaning independent of the character type of the source text being 59 * parsed. So we need a separate data structure that ParserBase can hold on to 60 * for it. (ParserBase isn't the only instance of this, but it's certainly the 61 * biggest case of it.) Ergo, TokenStreamAnyChars. 62 * 63 * == TokenStreamCharsShared → ∅ == 64 * 65 * Some functionality has meaning independent of character type, yet has no use 66 * *unless* you know the character type in actual use. It *could* live in 67 * TokenStreamAnyChars, but it makes more sense to live in a separate class 68 * that character-aware token information can simply inherit. 69 * 70 * This class currently exists only to contain a char16_t buffer, transiently 71 * used to accumulate strings in tricky cases that can't just be read directly 72 * from source text. It's not used outside character-aware tokenizing, so it 73 * doesn't make sense in TokenStreamAnyChars. 74 * 75 * == TokenStreamCharsBase<Unit> → TokenStreamCharsShared == 76 * 77 * Certain data structures in tokenizing are character-type-specific: namely, 78 * the various pointers identifying the source text (including current offset 79 * and end). 80 * 81 * Additionally, some functions operating on this data are defined the same way 82 * no matter what character type you have (e.g. current offset in code units 83 * into the source text) or share a common interface regardless of character 84 * type (e.g. consume the next code unit if it has a given value). 85 * 86 * All such functionality lives in TokenStreamCharsBase<Unit>. 87 * 88 * == SpecializedTokenStreamCharsBase<Unit> → TokenStreamCharsBase<Unit> == 89 * 90 * Certain tokenizing functionality is specific to a single character type. 91 * For example, JS's UTF-16 encoding recognizes no coding errors, because lone 92 * surrogates are not an error; but a UTF-8 encoding must recognize a variety 93 * of validation errors. Such functionality is defined only in the appropriate 94 * SpecializedTokenStreamCharsBase specialization. 95 * 96 * == GeneralTokenStreamChars<Unit, AnyCharsAccess> → 97 * SpecializedTokenStreamCharsBase<Unit> == 98 * 99 * Some functionality operates differently on different character types, just 100 * as for TokenStreamCharsBase, but additionally requires access to character- 101 * type-agnostic information in TokenStreamAnyChars. For example, getting the 102 * next character performs different steps for different character types and 103 * must access TokenStreamAnyChars to update line break information. 104 * 105 * Such functionality, if it can be defined using the same algorithm for all 106 * character types, lives in GeneralTokenStreamChars<Unit, AnyCharsAccess>. 107 * The AnyCharsAccess parameter provides a way for a GeneralTokenStreamChars 108 * instance to access its corresponding TokenStreamAnyChars, without inheriting 109 * from it. 110 * 111 * GeneralTokenStreamChars<Unit, AnyCharsAccess> is just functionality, no 112 * actual member data. 113 * 114 * Such functionality all lives in TokenStreamChars<Unit, AnyCharsAccess>, a 115 * declared-but-not-defined template class whose specializations have a common 116 * public interface (plus whatever private helper functions are desirable). 117 * 118 * == TokenStreamChars<Unit, AnyCharsAccess> → 119 * GeneralTokenStreamChars<Unit, AnyCharsAccess> == 120 * 121 * Some functionality is like that in GeneralTokenStreamChars, *but* it's 122 * defined entirely differently for different character types. 123 * 124 * For example, consider "match a multi-code unit code point" (hypothetically: 125 * we've only implemented two-byte tokenizing right now): 126 * 127 * * For two-byte text, there must be two code units to get, the leading code 128 * unit must be a UTF-16 lead surrogate, and the trailing code unit must be 129 * a UTF-16 trailing surrogate. (If any of these fail to hold, a next code 130 * unit encodes that code point and is not multi-code unit.) 131 * * For single-byte Latin-1 text, there are no multi-code unit code points. 132 * * For single-byte UTF-8 text, the first code unit must have N > 1 of its 133 * highest bits set (and the next unset), and |N - 1| successive code units 134 * must have their high bit set and next-highest bit unset, *and* 135 * concatenating all unconstrained bits together must not produce a code 136 * point value that could have been encoded in fewer code units. 137 * 138 * This functionality can't be implemented as member functions in 139 * GeneralTokenStreamChars because we'd need to *partially specialize* those 140 * functions -- hold Unit constant while letting AnyCharsAccess vary. But 141 * C++ forbids function template partial specialization like this: either you 142 * fix *all* parameters or you fix none of them. 143 * 144 * Fortunately, C++ *does* allow *class* template partial specialization. So 145 * TokenStreamChars is a template class with one specialization per Unit. 146 * Functions can be defined differently in the different specializations, 147 * because AnyCharsAccess as the only template parameter on member functions 148 * *can* vary. 149 * 150 * All TokenStreamChars<Unit, AnyCharsAccess> specializations, one per Unit, 151 * are just functionality, no actual member data. 152 * 153 * == TokenStreamSpecific<Unit, AnyCharsAccess> → 154 * TokenStreamChars<Unit, AnyCharsAccess>, TokenStreamShared, 155 * ErrorReporter == 156 * 157 * TokenStreamSpecific is operations that are parametrized on character type 158 * but implement the *general* idea of tokenizing, without being intrinsically 159 * tied to character type. Notably, this includes all operations that can 160 * report warnings or errors at particular offsets, because we include a line 161 * of context with such errors -- and that necessarily accesses the raw 162 * characters of their specific type. 163 * 164 * Much TokenStreamSpecific operation depends on functionality in 165 * TokenStreamAnyChars. The obvious solution is to inherit it -- but this 166 * doesn't work in Parser: its ParserBase base class needs some 167 * TokenStreamAnyChars functionality without knowing character type. 168 * 169 * The AnyCharsAccess type parameter is a class that statically converts from a 170 * TokenStreamSpecific* to its corresponding TokenStreamAnyChars. The 171 * TokenStreamSpecific in Parser<ParseHandler, Unit> can then specify a class 172 * that properly converts from TokenStreamSpecific Parser::tokenStream to 173 * TokenStreamAnyChars ParserBase::anyChars. 174 * 175 * Could we hardcode one set of offset calculations for this and eliminate 176 * AnyCharsAccess? No. Offset calculations possibly could be hardcoded if 177 * TokenStreamSpecific were present in Parser before Parser::handler, assuring 178 * the same offsets in all Parser-related cases. But there's still a separate 179 * TokenStream class, that requires different offset calculations. So even if 180 * we wanted to hardcode this (it's not clear we would, because forcing the 181 * TokenStreamSpecific declarer to specify this is more explicit), we couldn't. 182 */ 183 184 #include "mozilla/ArrayUtils.h" 185 #include "mozilla/Assertions.h" 186 #include "mozilla/Attributes.h" 187 #include "mozilla/Casting.h" 188 #include "mozilla/Maybe.h" 189 #include "mozilla/MemoryChecking.h" 190 #include "mozilla/Span.h" 191 #include "mozilla/TextUtils.h" 192 #include "mozilla/Utf8.h" 193 194 #include <algorithm> 195 #include <stdarg.h> 196 #include <stddef.h> 197 #include <stdint.h> 198 #include <stdio.h> 199 #include <type_traits> 200 201 #include "jspubtd.h" 202 203 #include "frontend/ErrorReporter.h" 204 #include "frontend/ParserAtom.h" // ParserAtom, ParserAtomsTable, TaggedParserAtomIndex 205 #include "frontend/Token.h" 206 #include "frontend/TokenKind.h" 207 #include "js/CharacterEncoding.h" // JS::ConstUTF8CharsZ 208 #include "js/ColumnNumber.h" // JS::LimitedColumnNumberOneOrigin, JS::ColumnNumberOneOrigin, JS::ColumnNumberUnsignedOffset 209 #include "js/CompileOptions.h" 210 #include "js/friend/ErrorMessages.h" // JSMSG_* 211 #include "js/HashTable.h" // js::HashMap 212 #include "js/RegExpFlags.h" // JS::RegExpFlags 213 #include "js/UniquePtr.h" 214 #include "js/Vector.h" 215 #include "util/Unicode.h" 216 #include "vm/ErrorReporting.h" 217 218 struct KeywordInfo; 219 220 namespace js { 221 222 class FrontendContext; 223 224 namespace frontend { 225 226 // True if str is a keyword. 227 bool IsKeyword(TaggedParserAtomIndex atom); 228 229 // If `name` is reserved word, returns the TokenKind of it. 230 // TokenKind::Limit otherwise. 231 extern TokenKind ReservedWordTokenKind(TaggedParserAtomIndex name); 232 233 // If `name` is reserved word, returns string representation of it. 234 // nullptr otherwise. 235 extern const char* ReservedWordToCharZ(TaggedParserAtomIndex name); 236 237 // If `tt` is reserved word, returns string representation of it. 238 // nullptr otherwise. 239 extern const char* ReservedWordToCharZ(TokenKind tt); 240 241 enum class DeprecatedContent : uint8_t { 242 // No deprecated content was present. 243 None = 0, 244 // Octal literal not prefixed by "0o" but rather by just "0", e.g. 0755. 245 OctalLiteral, 246 // Octal character escape, e.g. "hell\157 world". 247 OctalEscape, 248 // NonOctalDecimalEscape, i.e. "\8" or "\9". 249 EightOrNineEscape, 250 }; 251 252 struct TokenStreamFlags { 253 // Hit end of file. 254 bool isEOF : 1; 255 // Non-whitespace since start of line. 256 bool isDirtyLine : 1; 257 // Hit a syntax error, at start or during a token. 258 bool hadError : 1; 259 260 // The nature of any deprecated content seen since last reset. 261 // We have to uint8_t instead DeprecatedContent to work around a GCC 7 bug. 262 // https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61414 263 uint8_t sawDeprecatedContent : 2; 264 265 TokenStreamFlags() 266 : isEOF(false), 267 isDirtyLine(false), 268 hadError(false), 269 sawDeprecatedContent(uint8_t(DeprecatedContent::None)) {} 270 }; 271 272 template <typename Unit> 273 class TokenStreamPosition; 274 275 /** 276 * TokenStream types and constants that are used in both TokenStreamAnyChars 277 * and TokenStreamSpecific. Do not add any non-static data members to this 278 * class! 279 */ 280 class TokenStreamShared { 281 protected: 282 // 1 current + (3 lookahead if EXPLICIT_RESOURCE_MANAGEMENT is enabled 283 // else 2 lookahead and rounded up to ^2) 284 // NOTE: This must be power of 2, in order to make `ntokensMask` work. 285 static constexpr size_t ntokens = 4; 286 287 static constexpr unsigned ntokensMask = ntokens - 1; 288 289 template <typename Unit> 290 friend class TokenStreamPosition; 291 292 public: 293 #ifdef ENABLE_EXPLICIT_RESOURCE_MANAGEMENT 294 // We need a lookahead buffer of atleast 3 for the AwaitUsing syntax. 295 static constexpr unsigned maxLookahead = 3; 296 #else 297 static constexpr unsigned maxLookahead = 2; 298 #endif 299 300 using Modifier = Token::Modifier; 301 static constexpr Modifier SlashIsDiv = Token::SlashIsDiv; 302 static constexpr Modifier SlashIsRegExp = Token::SlashIsRegExp; 303 static constexpr Modifier SlashIsInvalid = Token::SlashIsInvalid; 304 305 static void verifyConsistentModifier(Modifier modifier, 306 const Token& nextToken) { 307 MOZ_ASSERT( 308 modifier == nextToken.modifier || modifier == SlashIsInvalid, 309 "This token was scanned with both SlashIsRegExp and SlashIsDiv, " 310 "indicating the parser is confused about how to handle a slash here. " 311 "See comment at Token::Modifier."); 312 } 313 }; 314 315 static_assert(std::is_empty_v<TokenStreamShared>, 316 "TokenStreamShared shouldn't bloat classes that inherit from it"); 317 318 template <typename Unit, class AnyCharsAccess> 319 class TokenStreamSpecific; 320 321 template <typename Unit> 322 class MOZ_STACK_CLASS TokenStreamPosition final { 323 public: 324 template <class AnyCharsAccess> 325 inline explicit TokenStreamPosition( 326 TokenStreamSpecific<Unit, AnyCharsAccess>& tokenStream); 327 328 private: 329 TokenStreamPosition(const TokenStreamPosition&) = delete; 330 331 // Technically only TokenStreamSpecific<Unit, AnyCharsAccess>::seek with 332 // Unit constant and AnyCharsAccess varying must be friended, but 1) it's 333 // hard to friend one function in template classes, and 2) C++ doesn't 334 // allow partial friend specialization to target just that single class. 335 template <typename Char, class AnyCharsAccess> 336 friend class TokenStreamSpecific; 337 338 const Unit* buf; 339 TokenStreamFlags flags; 340 unsigned lineno; 341 size_t linebase; 342 size_t prevLinebase; 343 Token currentToken; 344 unsigned lookahead; 345 Token lookaheadTokens[TokenStreamShared::maxLookahead]; 346 }; 347 348 template <typename Unit> 349 class SourceUnits; 350 351 /** 352 * This class maps: 353 * 354 * * a sourceUnits offset (a 0-indexed count of code units) 355 * 356 * to 357 * 358 * * a (1-indexed) line number and 359 * * a (0-indexed) offset in code *units* (not code points, not bytes) into 360 * that line, 361 * 362 * for either |Unit = Utf8Unit| or |Unit = char16_t|. 363 * 364 * Note that, if |Unit = Utf8Unit|, the latter quantity is *not* the same as a 365 * column number, which is a count of UTF-16 code units. Computing a column 366 * number requires the offset within the line and the source units of that line 367 * (including what type |Unit| is, to know how to decode them). If you need a 368 * column number, functions in |GeneralTokenStreamChars<Unit>| will consult 369 * this and source units to compute it. 370 */ 371 class SourceCoords { 372 // For a given buffer holding source code, |lineStartOffsets_| has one 373 // element per line of source code, plus one sentinel element. Each 374 // non-sentinel element holds the buffer offset for the start of the 375 // corresponding line of source code. For this example script, 376 // assuming an initialLineOffset of 0: 377 // 378 // 1 // xyz [line starts at offset 0] 379 // 2 var x; [line starts at offset 7] 380 // 3 [line starts at offset 14] 381 // 4 var y; [line starts at offset 15] 382 // 383 // |lineStartOffsets_| is: 384 // 385 // [0, 7, 14, 15, MAX_PTR] 386 // 387 // To convert a "line number" to an "index" into |lineStartOffsets_|, 388 // subtract |initialLineNum_|. E.g. line 3's index is 389 // (3 - initialLineNum_), which is 2. Therefore lineStartOffsets_[2] 390 // holds the buffer offset for the start of line 3, which is 14. (Note 391 // that |initialLineNum_| is often 1, but not always. 392 // 393 // The first element is always initialLineOffset, passed to the 394 // constructor, and the last element is always the MAX_PTR sentinel. 395 // 396 // Offset-to-{line,offset-into-line} lookups are O(log n) in the worst 397 // case (binary search), but in practice they're heavily clustered and 398 // we do better than that by using the previous lookup's result 399 // (lastIndex_) as a starting point. 400 // 401 // Checking if an offset lies within a particular line number 402 // (isOnThisLine()) is O(1). 403 // 404 Vector<uint32_t, 128> lineStartOffsets_; 405 406 /** The line number on which the source text begins. */ 407 uint32_t initialLineNum_; 408 409 /** 410 * The index corresponding to the last offset lookup -- used so that if 411 * offset lookups proceed in increasing order, and and the offset appears 412 * in the next couple lines from the last offset, we can avoid a full 413 * binary-search. 414 * 415 * This is mutable because it's modified on every search, but that fact 416 * isn't visible outside this class. 417 */ 418 mutable uint32_t lastIndex_; 419 420 uint32_t indexFromOffset(uint32_t offset) const; 421 422 static const uint32_t MAX_PTR = UINT32_MAX; 423 424 uint32_t lineNumberFromIndex(uint32_t index) const { 425 return index + initialLineNum_; 426 } 427 428 uint32_t indexFromLineNumber(uint32_t lineNum) const { 429 return lineNum - initialLineNum_; 430 } 431 432 public: 433 SourceCoords(FrontendContext* fc, uint32_t initialLineNumber, 434 uint32_t initialOffset); 435 436 [[nodiscard]] bool add(uint32_t lineNum, uint32_t lineStartOffset); 437 [[nodiscard]] bool fill(const SourceCoords& other); 438 439 std::optional<bool> isOnThisLine(uint32_t offset, uint32_t lineNum) const { 440 uint32_t index = indexFromLineNumber(lineNum); 441 if (index + 1 >= lineStartOffsets_.length()) { // +1 due to sentinel 442 return std::nullopt; 443 } 444 return (lineStartOffsets_[index] <= offset && 445 offset < lineStartOffsets_[index + 1]); 446 } 447 448 /** 449 * A token, computed for an offset in source text, that can be used to 450 * access line number and line-offset information for that offset. 451 * 452 * LineToken *alone* exposes whether the corresponding offset is in the 453 * the first line of source (which may not be 1, depending on 454 * |initialLineNumber|), and whether it's in the same line as 455 * another LineToken. 456 */ 457 class LineToken { 458 uint32_t index; 459 #ifdef DEBUG 460 uint32_t offset_; // stored for consistency-of-use assertions 461 #endif 462 463 friend class SourceCoords; 464 465 public: 466 LineToken(uint32_t index, uint32_t offset) 467 : index(index) 468 #ifdef DEBUG 469 , 470 offset_(offset) 471 #endif 472 { 473 } 474 475 bool isFirstLine() const { return index == 0; } 476 477 bool isSameLine(LineToken other) const { return index == other.index; } 478 479 void assertConsistentOffset(uint32_t offset) const { 480 MOZ_ASSERT(offset_ == offset); 481 } 482 }; 483 484 /** 485 * Compute a token usable to access information about the line at the 486 * given offset. 487 * 488 * The only information directly accessible in a token is whether it 489 * corresponds to the first line of source text (which may not be line 490 * 1, depending on the |initialLineNumber| value used to construct 491 * this). Use |lineNumber(LineToken)| to compute the actual line 492 * number (incorporating the contribution of |initialLineNumber|). 493 */ 494 LineToken lineToken(uint32_t offset) const; 495 496 /** Compute the line number for the given token. */ 497 uint32_t lineNumber(LineToken lineToken) const { 498 return lineNumberFromIndex(lineToken.index); 499 } 500 501 /** Return the offset of the start of the line for |lineToken|. */ 502 uint32_t lineStart(LineToken lineToken) const { 503 MOZ_ASSERT(lineToken.index + 1 < lineStartOffsets_.length(), 504 "recorded line-start information must be available"); 505 return lineStartOffsets_[lineToken.index]; 506 } 507 }; 508 509 enum class UnitsType : unsigned char { 510 PossiblyMultiUnit = 0, 511 GuaranteedSingleUnit = 1, 512 }; 513 514 class ChunkInfo { 515 private: 516 // Column number offset in UTF-16 code units. 517 // Store everything in |unsigned char|s so everything packs. 518 unsigned char columnOffset_[sizeof(uint32_t)]; 519 unsigned char unitsType_; 520 521 public: 522 ChunkInfo(JS::ColumnNumberUnsignedOffset offset, UnitsType type) 523 : unitsType_(static_cast<unsigned char>(type)) { 524 memcpy(columnOffset_, offset.addressOfValueForTranscode(), sizeof(offset)); 525 } 526 527 JS::ColumnNumberUnsignedOffset columnOffset() const { 528 JS::ColumnNumberUnsignedOffset offset; 529 memcpy(offset.addressOfValueForTranscode(), columnOffset_, 530 sizeof(uint32_t)); 531 return offset; 532 } 533 534 UnitsType unitsType() const { 535 MOZ_ASSERT(unitsType_ <= 1, "unitsType_ must be 0 or 1"); 536 return static_cast<UnitsType>(unitsType_); 537 } 538 539 void guaranteeSingleUnits() { 540 MOZ_ASSERT(unitsType() == UnitsType::PossiblyMultiUnit, 541 "should only be setting to possibly optimize from the " 542 "pessimistic case"); 543 unitsType_ = static_cast<unsigned char>(UnitsType::GuaranteedSingleUnit); 544 } 545 }; 546 547 enum class InvalidEscapeType { 548 // No invalid character escapes. 549 None, 550 // A malformed \x escape. 551 Hexadecimal, 552 // A malformed \u escape. 553 Unicode, 554 // An otherwise well-formed \u escape which represents a 555 // codepoint > 10FFFF. 556 UnicodeOverflow, 557 // An octal escape in a template token. 558 Octal, 559 // NonOctalDecimalEscape - \8 or \9. 560 EightOrNine 561 }; 562 563 class TokenStreamAnyChars : public TokenStreamShared { 564 private: 565 // Constant-at-construction fields. 566 567 FrontendContext* const fc; 568 569 /** Options used for parsing/tokenizing. */ 570 const JS::ReadOnlyCompileOptions& options_; 571 572 /** 573 * Pointer used internally to test whether in strict mode. Use |strictMode()| 574 * instead of this field. 575 */ 576 StrictModeGetter* const strictModeGetter_; 577 578 /** Input filename or null. */ 579 JS::ConstUTF8CharsZ filename_; 580 581 // Column number computation fields. 582 // Used only for UTF-8 case. 583 584 /** 585 * A map of (line number => sequence of the column numbers at 586 * |ColumnChunkLength|-unit boundaries rewound [if needed] to the nearest code 587 * point boundary). (|TokenStreamAnyChars::computeColumnOffset| is the sole 588 * user of |ColumnChunkLength| and therefore contains its definition.) 589 * 590 * Entries appear in this map only when a column computation of sufficient 591 * distance is performed on a line -- and only when the column is beyond the 592 * first |ColumnChunkLength| units. Each line's vector is lazily filled as 593 * greater offsets require column computations. 594 */ 595 mutable HashMap<uint32_t, Vector<ChunkInfo>> longLineColumnInfo_; 596 597 // Computing accurate column numbers requires at *some* point linearly 598 // iterating through prior source units in the line, to properly account for 599 // multi-unit code points. This is quadratic if counting happens repeatedly. 600 // 601 // But usually we need columns for advancing offsets through scripts. By 602 // caching the last ((line number, offset) => relative column) mapping (in 603 // similar manner to how |SourceCoords::lastIndex_| is used to cache 604 // (offset => line number) mappings) we can usually avoid re-iterating through 605 // the common line prefix. 606 // 607 // Additionally, we avoid hash table lookup costs by caching the 608 // |Vector<ChunkInfo>*| for the line of the last lookup. (|nullptr| means we 609 // must look it up -- or it hasn't been created yet.) This pointer is nulled 610 // when a lookup on a new line occurs, but as it's not a pointer at literal, 611 // reallocatable element data, it's *not* invalidated when new entries are 612 // added to such a vector. 613 614 /** 615 * The line in which the last column computation occurred, or UINT32_MAX if 616 * no prior computation has yet happened. 617 */ 618 mutable uint32_t lineOfLastColumnComputation_ = UINT32_MAX; 619 620 /** 621 * The chunk vector of the line for that last column computation. This is 622 * null if the chunk vector needs to be recalculated or initially created. 623 */ 624 mutable Vector<ChunkInfo>* lastChunkVectorForLine_ = nullptr; 625 626 /** 627 * The offset (in code units) of the last column computation performed, 628 * relative to source start. 629 */ 630 mutable uint32_t lastOffsetOfComputedColumn_ = UINT32_MAX; 631 632 /** 633 * The column number offset from the 1st column for the offset (in code units) 634 * of the last column computation performed, relative to source start. 635 */ 636 mutable JS::ColumnNumberUnsignedOffset lastComputedColumnOffset_; 637 638 // Intra-token fields. 639 640 /** 641 * The offset of the first invalid escape in a template literal. (If there is 642 * one -- if not, the value of this field is meaningless.) 643 * 644 * See also |invalidTemplateEscapeType|. 645 */ 646 uint32_t invalidTemplateEscapeOffset = 0; 647 648 /** 649 * The type of the first invalid escape in a template literal. (If there 650 * isn't one, this will be |None|.) 651 * 652 * See also |invalidTemplateEscapeOffset|. 653 */ 654 InvalidEscapeType invalidTemplateEscapeType = InvalidEscapeType::None; 655 656 // Fields with values relevant across tokens (and therefore potentially across 657 // function boundaries, such that lazy function parsing and stream-seeking 658 // must take care in saving and restoring them). 659 660 /** Line number and offset-to-line mapping information. */ 661 SourceCoords srcCoords; 662 663 /** Circular token buffer of gotten tokens that have been ungotten. */ 664 Token tokens[ntokens] = {}; 665 666 /** The index in |tokens| of the last parsed token. */ 667 unsigned cursor_ = 0; 668 669 /** The number of tokens in |tokens| available to be gotten. */ 670 unsigned lookahead = 0; 671 672 /** The current line number. */ 673 unsigned lineno; 674 675 /** Various flag bits (see above). */ 676 TokenStreamFlags flags = {}; 677 678 /** The offset of the start of the current line. */ 679 size_t linebase = 0; 680 681 /** The start of the previous line, or |size_t(-1)| on the first line. */ 682 size_t prevLinebase = size_t(-1); 683 684 /** The user's requested source URL. Null if none has been set. */ 685 UniqueTwoByteChars displayURL_ = nullptr; 686 687 /** The URL of the source map for this script. Null if none has been set. */ 688 UniqueTwoByteChars sourceMapURL_ = nullptr; 689 690 // Assorted boolean fields, none of which require maintenance across tokens, 691 // stored at class end to minimize padding. 692 693 /** 694 * Whether syntax errors should or should not contain details about the 695 * precise nature of the error. (This is intended for use in suppressing 696 * content-revealing details about syntax errors in cross-origin scripts on 697 * the web.) 698 */ 699 const bool mutedErrors; 700 701 /** 702 * An array storing whether a TokenKind observed while attempting to extend 703 * a valid AssignmentExpression into an even longer AssignmentExpression 704 * (e.g., extending '3' to '3 + 5') will terminate it without error. 705 * 706 * For example, ';' always ends an AssignmentExpression because it ends a 707 * Statement or declaration. '}' always ends an AssignmentExpression 708 * because it terminates BlockStatement, FunctionBody, and embedded 709 * expressions in TemplateLiterals. Therefore both entries are set to true 710 * in TokenStreamAnyChars construction. 711 * 712 * But e.g. '+' *could* extend an AssignmentExpression, so its entry here 713 * is false. Meanwhile 'this' can't extend an AssignmentExpression, but 714 * it's only valid after a line break, so its entry here must be false. 715 * 716 * NOTE: This array could be static, but without C99's designated 717 * initializers it's easier zeroing here and setting the true entries 718 * in the constructor body. (Having this per-instance might also aid 719 * locality.) Don't worry! Initialization time for each TokenStream 720 * is trivial. See bug 639420. 721 */ 722 bool isExprEnding[size_t(TokenKind::Limit)] = {}; // all-false initially 723 724 // End of fields. 725 726 public: 727 TokenStreamAnyChars(FrontendContext* fc, 728 const JS::ReadOnlyCompileOptions& options, 729 StrictModeGetter* smg); 730 731 template <typename Unit, class AnyCharsAccess> 732 friend class GeneralTokenStreamChars; 733 template <typename Unit, class AnyCharsAccess> 734 friend class TokenStreamChars; 735 template <typename Unit, class AnyCharsAccess> 736 friend class TokenStreamSpecific; 737 738 template <typename Unit> 739 friend class TokenStreamPosition; 740 741 // Accessors. 742 unsigned cursor() const { return cursor_; } 743 unsigned nextCursor() const { return (cursor_ + 1) & ntokensMask; } 744 unsigned aheadCursor(unsigned steps) const { 745 return (cursor_ + steps) & ntokensMask; 746 } 747 748 const Token& currentToken() const { return tokens[cursor()]; } 749 bool isCurrentTokenType(TokenKind type) const { 750 return currentToken().type == type; 751 } 752 753 [[nodiscard]] bool checkOptions(); 754 755 private: 756 TaggedParserAtomIndex reservedWordToPropertyName(TokenKind tt) const; 757 758 public: 759 TaggedParserAtomIndex currentName() const { 760 if (isCurrentTokenType(TokenKind::Name) || 761 isCurrentTokenType(TokenKind::PrivateName)) { 762 return currentToken().name(); 763 } 764 765 MOZ_ASSERT(TokenKindIsPossibleIdentifierName(currentToken().type)); 766 return reservedWordToPropertyName(currentToken().type); 767 } 768 769 bool currentNameHasEscapes(ParserAtomsTable& parserAtoms) const { 770 if (isCurrentTokenType(TokenKind::Name) || 771 isCurrentTokenType(TokenKind::PrivateName)) { 772 TokenPos pos = currentToken().pos; 773 return (pos.end - pos.begin) != parserAtoms.length(currentToken().name()); 774 } 775 776 MOZ_ASSERT(TokenKindIsPossibleIdentifierName(currentToken().type)); 777 return false; 778 } 779 780 bool isCurrentTokenAssignment() const { 781 return TokenKindIsAssignment(currentToken().type); 782 } 783 784 // Flag methods. 785 bool isEOF() const { return flags.isEOF; } 786 bool hadError() const { return flags.hadError; } 787 788 DeprecatedContent sawDeprecatedContent() const { 789 return static_cast<DeprecatedContent>(flags.sawDeprecatedContent); 790 } 791 792 private: 793 // Workaround GCC 7 sadness. 794 void setSawDeprecatedContent(DeprecatedContent content) { 795 flags.sawDeprecatedContent = static_cast<uint8_t>(content); 796 } 797 798 public: 799 void clearSawDeprecatedContent() { 800 setSawDeprecatedContent(DeprecatedContent::None); 801 } 802 void setSawDeprecatedOctalLiteral() { 803 setSawDeprecatedContent(DeprecatedContent::OctalLiteral); 804 } 805 void setSawDeprecatedOctalEscape() { 806 setSawDeprecatedContent(DeprecatedContent::OctalEscape); 807 } 808 void setSawDeprecatedEightOrNineEscape() { 809 setSawDeprecatedContent(DeprecatedContent::EightOrNineEscape); 810 } 811 812 bool hasInvalidTemplateEscape() const { 813 return invalidTemplateEscapeType != InvalidEscapeType::None; 814 } 815 void clearInvalidTemplateEscape() { 816 invalidTemplateEscapeType = InvalidEscapeType::None; 817 } 818 819 private: 820 // This is private because it should only be called by the tokenizer while 821 // tokenizing not by, for example, BytecodeEmitter. 822 bool strictMode() const { 823 return strictModeGetter_ && strictModeGetter_->strictMode(); 824 } 825 826 void setInvalidTemplateEscape(uint32_t offset, InvalidEscapeType type) { 827 MOZ_ASSERT(type != InvalidEscapeType::None); 828 if (invalidTemplateEscapeType != InvalidEscapeType::None) { 829 return; 830 } 831 invalidTemplateEscapeOffset = offset; 832 invalidTemplateEscapeType = type; 833 } 834 835 public: 836 // Call this immediately after parsing an OrExpression to allow scanning the 837 // next token with SlashIsRegExp without asserting (even though we just 838 // peeked at it in SlashIsDiv mode). 839 // 840 // It's OK to disable the assertion because the places where this is called 841 // have peeked at the next token in SlashIsDiv mode, and checked that it is 842 // *not* a Div token. 843 // 844 // To see why it is necessary to disable the assertion, consider these two 845 // programs: 846 // 847 // x = arg => q // per spec, this is all one statement, and the 848 // /a/g; // slashes are division operators 849 // 850 // x = arg => {} // per spec, ASI at the end of this line 851 // /a/g; // and that's a regexp literal 852 // 853 // The first program shows why orExpr() has use SlashIsDiv mode when peeking 854 // ahead for the next operator after parsing `q`. The second program shows 855 // why matchOrInsertSemicolon() must use SlashIsRegExp mode when scanning 856 // ahead for a semicolon. 857 void allowGettingNextTokenWithSlashIsRegExp() { 858 #ifdef DEBUG 859 // Check the precondition: Caller already peeked ahead at the next token, 860 // in SlashIsDiv mode, and it is *not* a Div token. 861 MOZ_ASSERT(hasLookahead()); 862 const Token& next = nextToken(); 863 MOZ_ASSERT(next.modifier == SlashIsDiv); 864 MOZ_ASSERT(next.type != TokenKind::Div); 865 tokens[nextCursor()].modifier = SlashIsRegExp; 866 #endif 867 } 868 869 #ifdef DEBUG 870 inline bool debugHasNoLookahead() const { return lookahead == 0; } 871 #endif 872 873 bool hasDisplayURL() const { return displayURL_ != nullptr; } 874 875 char16_t* displayURL() { return displayURL_.get(); } 876 877 bool hasSourceMapURL() const { return sourceMapURL_ != nullptr; } 878 879 char16_t* sourceMapURL() { return sourceMapURL_.get(); } 880 881 FrontendContext* context() const { return fc; } 882 883 using LineToken = SourceCoords::LineToken; 884 885 LineToken lineToken(uint32_t offset) const { 886 return srcCoords.lineToken(offset); 887 } 888 889 uint32_t lineNumber(LineToken lineToken) const { 890 return srcCoords.lineNumber(lineToken); 891 } 892 893 uint32_t lineStart(LineToken lineToken) const { 894 return srcCoords.lineStart(lineToken); 895 } 896 897 /** 898 * Fill in |err|. 899 * 900 * If the token stream doesn't have location info for this error, use the 901 * caller's location (including line/column number) and return false. (No 902 * line of context is set.) 903 * 904 * Otherwise fill in everything in |err| except 1) line/column numbers and 905 * 2) line-of-context-related fields and return true. The caller *must* 906 * fill in the line/column number; filling the line of context is optional. 907 */ 908 bool fillExceptingContext(ErrorMetadata* err, uint32_t offset) const; 909 910 MOZ_ALWAYS_INLINE void updateFlagsForEOL() { flags.isDirtyLine = false; } 911 912 private: 913 /** 914 * Compute the column number offset from the 1st code unit in the line in 915 * UTF-16 code units, for given absolute |offset| within source text on the 916 * line of |lineToken| (which must have been computed from |offset|). 917 * 918 * A column number offset on a line that isn't the first line is just 919 * the actual column number in 0-origin. But a column number offset 920 * on the first line is the column number offset from the initial 921 * line/column of the script. For example, consider this HTML with 922 * line/column number keys: 923 * 924 * Column number in 1-origin 925 * 1 2 3 926 * 123456789012345678901234 567890 927 * 928 * Column number in 0-origin, and the offset from 1st column 929 * 1 2 3 930 * 0123456789012345678901234 567890 931 * ------------------------------------ 932 * 1 | <html> 933 * 2 | <head> 934 * 3 | <script>var x = 3; x < 4; 935 * 4 | const y = 7;</script> 936 * 5 | </head> 937 * 6 | <body></body> 938 * 7 | </html> 939 * 940 * The script would be compiled specifying initial (line, column) of (3, 10) 941 * using |JS::ReadOnlyCompileOptions::{lineno,column}|, which is 0-origin. 942 * And the column reported by |computeColumn| for the "v" of |var| would be 943 * 11 (in 1-origin). But the column number offset of the "v" in |var|, that 944 * this function returns, would be 0. On the other hand, the column reported 945 * by |computeColumn| would be 1 (in 1-origin) and the column number offset 946 * returned by this function for the "c" in |const| would be 0, because it's 947 * not in the first line of source text. 948 * 949 * The column number offset is with respect *only* to the JavaScript source 950 * text as SpiderMonkey sees it. In the example, the "<" is converted to 951 * "<" by the browser before SpiderMonkey would see it. So the column number 952 * offset of the "4" in the inequality would be 16, not 19. 953 * 954 * UTF-16 code units are not all equal length in UTF-8 source, so counting 955 * requires *some* kind of linear-time counting from the start of the line. 956 * This function attempts various tricks to reduce this cost. If these 957 * optimizations succeed, repeated calls to this function on a line will pay 958 * a one-time cost linear in the length of the line, then each call pays a 959 * separate constant-time cost. If the optimizations do not succeed, this 960 * function works in time linear in the length of the line. 961 * 962 * It's unusual for a function in *this* class to be |Unit|-templated, but 963 * while this operation manages |Unit|-agnostic fields in this class and in 964 * |srcCoords|, it must *perform* |Unit|-sensitive computations to fill them. 965 * And this is the best place to do that. 966 */ 967 template <typename Unit> 968 JS::ColumnNumberUnsignedOffset computeColumnOffset( 969 const LineToken lineToken, const uint32_t offset, 970 const SourceUnits<Unit>& sourceUnits) const; 971 972 template <typename Unit> 973 JS::ColumnNumberUnsignedOffset computeColumnOffsetForUTF8( 974 const LineToken lineToken, const uint32_t offset, const uint32_t start, 975 const uint32_t offsetInLine, const SourceUnits<Unit>& sourceUnits) const; 976 977 /** 978 * Update line/column information for the start of a new line at 979 * |lineStartOffset|. 980 */ 981 [[nodiscard]] MOZ_ALWAYS_INLINE bool internalUpdateLineInfoForEOL( 982 uint32_t lineStartOffset); 983 984 public: 985 const Token& nextToken() const { 986 MOZ_ASSERT(hasLookahead()); 987 return tokens[nextCursor()]; 988 } 989 990 bool hasLookahead() const { return lookahead > 0; } 991 992 void advanceCursor() { cursor_ = (cursor_ + 1) & ntokensMask; } 993 994 void retractCursor() { cursor_ = (cursor_ - 1) & ntokensMask; } 995 996 Token* allocateToken() { 997 advanceCursor(); 998 999 Token* tp = &tokens[cursor()]; 1000 MOZ_MAKE_MEM_UNDEFINED(tp, sizeof(*tp)); 1001 1002 return tp; 1003 } 1004 1005 // Push the last scanned token back into the stream. 1006 void ungetToken() { 1007 MOZ_ASSERT(lookahead < maxLookahead); 1008 lookahead++; 1009 retractCursor(); 1010 } 1011 1012 public: 1013 void adoptState(TokenStreamAnyChars& other) { 1014 // If |other| has fresh information from directives, overwrite any 1015 // previously recorded directives. (There is no specification directing 1016 // that last-in-source-order directive controls, sadly. We behave this way 1017 // in the ordinary case, so we ought do so here too.) 1018 if (auto& url = other.displayURL_) { 1019 displayURL_ = std::move(url); 1020 } 1021 if (auto& url = other.sourceMapURL_) { 1022 sourceMapURL_ = std::move(url); 1023 } 1024 } 1025 1026 // Compute error metadata for an error at no offset. 1027 void computeErrorMetadataNoOffset(ErrorMetadata* err) const; 1028 1029 // ErrorReporter API Helpers 1030 1031 // Provide minimal set of error reporting API given we cannot use 1032 // ErrorReportMixin here. "report" prefix is added to avoid conflict with 1033 // ErrorReportMixin methods in TokenStream class. 1034 void reportErrorNoOffset(unsigned errorNumber, ...) const; 1035 void reportErrorNoOffsetVA(unsigned errorNumber, va_list* args) const; 1036 1037 const JS::ReadOnlyCompileOptions& options() const { return options_; } 1038 1039 JS::ConstUTF8CharsZ getFilename() const { return filename_; } 1040 }; 1041 1042 constexpr char16_t CodeUnitValue(char16_t unit) { return unit; } 1043 1044 constexpr uint8_t CodeUnitValue(mozilla::Utf8Unit unit) { 1045 return unit.toUint8(); 1046 } 1047 1048 template <typename Unit> 1049 class TokenStreamCharsBase; 1050 1051 template <typename T> 1052 inline bool IsLineTerminator(T) = delete; 1053 1054 inline bool IsLineTerminator(char32_t codePoint) { 1055 return codePoint == '\n' || codePoint == '\r' || 1056 codePoint == unicode::LINE_SEPARATOR || 1057 codePoint == unicode::PARA_SEPARATOR; 1058 } 1059 1060 inline bool IsLineTerminator(char16_t unit) { 1061 // Every LineTerminator fits in char16_t, so this is exact. 1062 return IsLineTerminator(static_cast<char32_t>(unit)); 1063 } 1064 1065 template <typename Unit> 1066 struct SourceUnitTraits; 1067 1068 template <> 1069 struct SourceUnitTraits<char16_t> { 1070 public: 1071 static constexpr uint8_t maxUnitsLength = 2; 1072 1073 static constexpr size_t lengthInUnits(char32_t codePoint) { 1074 return codePoint < unicode::NonBMPMin ? 1 : 2; 1075 } 1076 }; 1077 1078 template <> 1079 struct SourceUnitTraits<mozilla::Utf8Unit> { 1080 public: 1081 static constexpr uint8_t maxUnitsLength = 4; 1082 1083 static constexpr size_t lengthInUnits(char32_t codePoint) { 1084 return codePoint < 0x80 ? 1 1085 : codePoint < 0x800 ? 2 1086 : codePoint < 0x10000 ? 3 1087 : 4; 1088 } 1089 }; 1090 1091 /** 1092 * PeekedCodePoint represents the result of peeking ahead in some source text 1093 * to determine the next validly-encoded code point. 1094 * 1095 * If there isn't a valid code point, then |isNone()|. 1096 * 1097 * But if there *is* a valid code point, then |!isNone()|, the code point has 1098 * value |codePoint()| and its length in code units is |lengthInUnits()|. 1099 * 1100 * Conceptually, this class is |Maybe<struct { char32_t v; uint8_t len; }>|. 1101 */ 1102 template <typename Unit> 1103 class PeekedCodePoint final { 1104 char32_t codePoint_ = 0; 1105 uint8_t lengthInUnits_ = 0; 1106 1107 private: 1108 using SourceUnitTraits = frontend::SourceUnitTraits<Unit>; 1109 1110 PeekedCodePoint() = default; 1111 1112 public: 1113 /** 1114 * Create a peeked code point with the given value and length in code 1115 * units. 1116 * 1117 * While the latter value is computable from the former for both UTF-8 and 1118 * JS's version of UTF-16, the caller likely computed a length in units in 1119 * the course of determining the peeked value. Passing both here avoids 1120 * recomputation and lets us do a consistency-checking assertion. 1121 */ 1122 PeekedCodePoint(char32_t codePoint, uint8_t lengthInUnits) 1123 : codePoint_(codePoint), lengthInUnits_(lengthInUnits) { 1124 MOZ_ASSERT(codePoint <= unicode::NonBMPMax); 1125 MOZ_ASSERT(lengthInUnits != 0, "bad code point length"); 1126 MOZ_ASSERT(lengthInUnits == SourceUnitTraits::lengthInUnits(codePoint)); 1127 } 1128 1129 /** Create a PeekedCodeUnit that represents no valid code point. */ 1130 static PeekedCodePoint none() { return PeekedCodePoint(); } 1131 1132 /** True if no code point was found, false otherwise. */ 1133 bool isNone() const { return lengthInUnits_ == 0; } 1134 1135 /** If a code point was found, its value. */ 1136 char32_t codePoint() const { 1137 MOZ_ASSERT(!isNone()); 1138 return codePoint_; 1139 } 1140 1141 /** If a code point was found, its length in code units. */ 1142 uint8_t lengthInUnits() const { 1143 MOZ_ASSERT(!isNone()); 1144 return lengthInUnits_; 1145 } 1146 }; 1147 1148 inline PeekedCodePoint<char16_t> PeekCodePoint(const char16_t* const ptr, 1149 const char16_t* const end) { 1150 if (MOZ_UNLIKELY(ptr >= end)) { 1151 return PeekedCodePoint<char16_t>::none(); 1152 } 1153 1154 char16_t lead = ptr[0]; 1155 1156 char32_t c; 1157 uint8_t len; 1158 if (MOZ_LIKELY(!unicode::IsLeadSurrogate(lead)) || 1159 MOZ_UNLIKELY(ptr + 1 >= end || !unicode::IsTrailSurrogate(ptr[1]))) { 1160 c = lead; 1161 len = 1; 1162 } else { 1163 c = unicode::UTF16Decode(lead, ptr[1]); 1164 len = 2; 1165 } 1166 1167 return PeekedCodePoint<char16_t>(c, len); 1168 } 1169 1170 inline PeekedCodePoint<mozilla::Utf8Unit> PeekCodePoint( 1171 const mozilla::Utf8Unit* const ptr, const mozilla::Utf8Unit* const end) { 1172 if (MOZ_UNLIKELY(ptr >= end)) { 1173 return PeekedCodePoint<mozilla::Utf8Unit>::none(); 1174 } 1175 1176 const mozilla::Utf8Unit lead = ptr[0]; 1177 if (mozilla::IsAscii(lead)) { 1178 return PeekedCodePoint<mozilla::Utf8Unit>(lead.toUint8(), 1); 1179 } 1180 1181 const mozilla::Utf8Unit* afterLead = ptr + 1; 1182 mozilla::Maybe<char32_t> codePoint = 1183 mozilla::DecodeOneUtf8CodePoint(lead, &afterLead, end); 1184 if (codePoint.isNothing()) { 1185 return PeekedCodePoint<mozilla::Utf8Unit>::none(); 1186 } 1187 1188 auto len = 1189 mozilla::AssertedCast<uint8_t>(mozilla::PointerRangeSize(ptr, afterLead)); 1190 MOZ_ASSERT(len <= 4); 1191 1192 return PeekedCodePoint<mozilla::Utf8Unit>(codePoint.value(), len); 1193 } 1194 1195 inline bool IsSingleUnitLineTerminator(mozilla::Utf8Unit unit) { 1196 // BEWARE: The Unicode line/paragraph separators don't fit in a single 1197 // UTF-8 code unit, so this test is exact for Utf8Unit but inexact 1198 // for UTF-8 as a whole. Users must handle |unit| as start of a 1199 // Unicode LineTerminator themselves! 1200 return unit == mozilla::Utf8Unit('\n') || unit == mozilla::Utf8Unit('\r'); 1201 } 1202 1203 // This is the low-level interface to the JS source code buffer. It just gets 1204 // raw Unicode code units -- 16-bit char16_t units of source text that are not 1205 // (always) full code points, and 8-bit units of UTF-8 source text soon. 1206 // TokenStreams functions are layered on top and do some extra stuff like 1207 // converting all EOL sequences to '\n', tracking the line number, and setting 1208 // |flags.isEOF|. (The "raw" in "raw Unicode code units" refers to the lack of 1209 // EOL sequence normalization.) 1210 // 1211 // buf[0..length-1] often represents a substring of some larger source, 1212 // where we have only the substring in memory. The |startOffset| argument 1213 // indicates the offset within this larger string at which our string 1214 // begins, the offset of |buf[0]|. 1215 template <typename Unit> 1216 class SourceUnits { 1217 private: 1218 /** Base of buffer. */ 1219 const Unit* base_; 1220 1221 /** Offset of base_[0]. */ 1222 uint32_t startOffset_; 1223 1224 /** Limit for quick bounds check. */ 1225 const Unit* limit_; 1226 1227 /** Next char to get. */ 1228 const Unit* ptr; 1229 1230 public: 1231 SourceUnits(const Unit* units, size_t length, size_t startOffset) 1232 : base_(units), 1233 startOffset_(startOffset), 1234 limit_(units + length), 1235 ptr(units) {} 1236 1237 bool atStart() const { 1238 MOZ_ASSERT(!isPoisoned(), "shouldn't be using if poisoned"); 1239 return ptr == base_; 1240 } 1241 1242 bool atEnd() const { 1243 MOZ_ASSERT(!isPoisoned(), "shouldn't be using if poisoned"); 1244 MOZ_ASSERT(ptr <= limit_, "shouldn't have overrun"); 1245 return ptr >= limit_; 1246 } 1247 1248 size_t remaining() const { 1249 MOZ_ASSERT(!isPoisoned(), 1250 "can't get a count of remaining code units if poisoned"); 1251 return mozilla::PointerRangeSize(ptr, limit_); 1252 } 1253 1254 size_t startOffset() const { return startOffset_; } 1255 1256 size_t offset() const { 1257 return startOffset_ + mozilla::PointerRangeSize(base_, ptr); 1258 } 1259 1260 const Unit* codeUnitPtrAt(size_t offset) const { 1261 MOZ_ASSERT(!isPoisoned(), "shouldn't be using if poisoned"); 1262 MOZ_ASSERT(startOffset_ <= offset); 1263 MOZ_ASSERT(offset - startOffset_ <= 1264 mozilla::PointerRangeSize(base_, limit_)); 1265 return base_ + (offset - startOffset_); 1266 } 1267 1268 const Unit* current() const { return ptr; } 1269 1270 const Unit* limit() const { return limit_; } 1271 1272 Unit previousCodeUnit() { 1273 MOZ_ASSERT(!isPoisoned(), "can't get previous code unit if poisoned"); 1274 MOZ_ASSERT(!atStart(), "must have a previous code unit to get"); 1275 return *(ptr - 1); 1276 } 1277 1278 MOZ_ALWAYS_INLINE Unit getCodeUnit() { 1279 return *ptr++; // this will nullptr-crash if poisoned 1280 } 1281 1282 Unit peekCodeUnit() const { 1283 return *ptr; // this will nullptr-crash if poisoned 1284 } 1285 1286 /** 1287 * Determine the next code point in source text. The code point is not 1288 * normalized: '\r', '\n', '\u2028', and '\u2029' are returned literally. 1289 * If there is no next code point because |atEnd()|, or if an encoding 1290 * error is encountered, return a |PeekedCodePoint| that |isNone()|. 1291 * 1292 * This function does not report errors: code that attempts to get the next 1293 * code point must report any error. 1294 * 1295 * If a next code point is found, it may be consumed by passing it to 1296 * |consumeKnownCodePoint|. 1297 */ 1298 PeekedCodePoint<Unit> peekCodePoint() const { 1299 return PeekCodePoint(ptr, limit_); 1300 } 1301 1302 private: 1303 #ifdef DEBUG 1304 void assertNextCodePoint(const PeekedCodePoint<Unit>& peeked); 1305 #endif 1306 1307 public: 1308 /** 1309 * Consume a peeked code point that |!isNone()|. 1310 * 1311 * This call DOES NOT UPDATE LINE-STATUS. You may need to call 1312 * |updateLineInfoForEOL()| and |updateFlagsForEOL()| if this consumes a 1313 * LineTerminator. Note that if this consumes '\r', you also must consume 1314 * an optional '\n' (i.e. a full LineTerminatorSequence) before doing so. 1315 */ 1316 void consumeKnownCodePoint(const PeekedCodePoint<Unit>& peeked) { 1317 MOZ_ASSERT(!peeked.isNone()); 1318 MOZ_ASSERT(peeked.lengthInUnits() <= remaining()); 1319 1320 #ifdef DEBUG 1321 assertNextCodePoint(peeked); 1322 #endif 1323 1324 ptr += peeked.lengthInUnits(); 1325 } 1326 1327 /** Match |n| hexadecimal digits and store their value in |*out|. */ 1328 bool matchHexDigits(uint8_t n, char16_t* out) { 1329 MOZ_ASSERT(!isPoisoned(), "shouldn't peek into poisoned SourceUnits"); 1330 MOZ_ASSERT(n <= 4, "hexdigit value can't overflow char16_t"); 1331 if (n > remaining()) { 1332 return false; 1333 } 1334 1335 char16_t v = 0; 1336 for (uint8_t i = 0; i < n; i++) { 1337 auto unit = CodeUnitValue(ptr[i]); 1338 if (!mozilla::IsAsciiHexDigit(unit)) { 1339 return false; 1340 } 1341 1342 v = (v << 4) | mozilla::AsciiAlphanumericToNumber(unit); 1343 } 1344 1345 *out = v; 1346 ptr += n; 1347 return true; 1348 } 1349 1350 bool matchCodeUnits(const char* chars, uint8_t length) { 1351 MOZ_ASSERT(!isPoisoned(), "shouldn't match into poisoned SourceUnits"); 1352 if (length > remaining()) { 1353 return false; 1354 } 1355 1356 const Unit* start = ptr; 1357 const Unit* end = ptr + length; 1358 while (ptr < end) { 1359 if (*ptr++ != Unit(*chars++)) { 1360 ptr = start; 1361 return false; 1362 } 1363 } 1364 1365 return true; 1366 } 1367 1368 void skipCodeUnits(uint32_t n) { 1369 MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits"); 1370 MOZ_ASSERT(n <= remaining(), "shouldn't skip beyond end of SourceUnits"); 1371 ptr += n; 1372 } 1373 1374 void unskipCodeUnits(uint32_t n) { 1375 MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits"); 1376 MOZ_ASSERT(n <= mozilla::PointerRangeSize(base_, ptr), 1377 "shouldn't unskip beyond start of SourceUnits"); 1378 ptr -= n; 1379 } 1380 1381 private: 1382 friend class TokenStreamCharsBase<Unit>; 1383 1384 bool internalMatchCodeUnit(Unit c) { 1385 MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits"); 1386 if (MOZ_LIKELY(!atEnd()) && *ptr == c) { 1387 ptr++; 1388 return true; 1389 } 1390 return false; 1391 } 1392 1393 public: 1394 void consumeKnownCodeUnit(Unit c) { 1395 MOZ_ASSERT(!isPoisoned(), "shouldn't use poisoned SourceUnits"); 1396 MOZ_ASSERT(*ptr == c, "consuming the wrong code unit"); 1397 ptr++; 1398 } 1399 1400 /** Unget U+2028 LINE SEPARATOR or U+2029 PARAGRAPH SEPARATOR. */ 1401 inline void ungetLineOrParagraphSeparator(); 1402 1403 void ungetCodeUnit() { 1404 MOZ_ASSERT(!isPoisoned(), "can't unget from poisoned units"); 1405 MOZ_ASSERT(!atStart(), "can't unget if currently at start"); 1406 ptr--; 1407 } 1408 1409 const Unit* addressOfNextCodeUnit(bool allowPoisoned = false) const { 1410 MOZ_ASSERT_IF(!allowPoisoned, !isPoisoned()); 1411 return ptr; 1412 } 1413 1414 // Use this with caution! 1415 void setAddressOfNextCodeUnit(const Unit* a, bool allowPoisoned = false) { 1416 MOZ_ASSERT_IF(!allowPoisoned, a); 1417 ptr = a; 1418 } 1419 1420 // Poison the SourceUnits so they can't be accessed again. 1421 void poisonInDebug() { 1422 #ifdef DEBUG 1423 ptr = nullptr; 1424 #endif 1425 } 1426 1427 private: 1428 bool isPoisoned() const { 1429 #ifdef DEBUG 1430 // |ptr| can be null for unpoisoned SourceUnits if this was initialized with 1431 // |units == nullptr| and |length == 0|. In that case, for lack of any 1432 // better options, consider this to not be poisoned. 1433 return ptr == nullptr && ptr != limit_; 1434 #else 1435 return false; 1436 #endif 1437 } 1438 1439 public: 1440 /** 1441 * Consume the rest of a single-line comment (but not the EOL/EOF that 1442 * terminates it). 1443 * 1444 * If an encoding error is encountered -- possible only for UTF-8 because 1445 * JavaScript's conception of UTF-16 encompasses any sequence of 16-bit 1446 * code units -- valid code points prior to the encoding error are consumed 1447 * and subsequent invalid code units are not consumed. For example, given 1448 * these UTF-8 code units: 1449 * 1450 * 'B' 'A' 'D' ':' <bad code unit sequence> 1451 * 0x42 0x41 0x44 0x3A 0xD0 0x00 ... 1452 * 1453 * the first four code units are consumed, but 0xD0 and 0x00 are not 1454 * consumed because 0xD0 encodes a two-byte lead unit but 0x00 is not a 1455 * valid trailing code unit. 1456 * 1457 * It is expected that the caller will report such an encoding error when 1458 * it attempts to consume the next code point. 1459 */ 1460 void consumeRestOfSingleLineComment(); 1461 1462 /** 1463 * The maximum radius of code around the location of an error that should 1464 * be included in a syntax error message -- this many code units to either 1465 * side. The resulting window of data is then accordinngly trimmed so that 1466 * the window contains only validly-encoded data. 1467 * 1468 * Because this number is the same for both UTF-8 and UTF-16, windows in 1469 * UTF-8 may contain fewer code points than windows in UTF-16. As we only 1470 * use this for error messages, we don't particularly care. 1471 */ 1472 static constexpr size_t WindowRadius = ErrorMetadata::lineOfContextRadius; 1473 1474 /** 1475 * From absolute offset |offset|, search backward to find an absolute 1476 * offset within source text, no further than |WindowRadius| code units 1477 * away from |offset|, such that all code points from that offset to 1478 * |offset| are valid, non-LineTerminator code points. 1479 */ 1480 size_t findWindowStart(size_t offset) const; 1481 1482 /** 1483 * From absolute offset |offset|, find an absolute offset within source 1484 * text, no further than |WindowRadius| code units away from |offset|, such 1485 * that all code units from |offset| to that offset are valid, 1486 * non-LineTerminator code points. 1487 */ 1488 size_t findWindowEnd(size_t offset) const; 1489 1490 /** 1491 * Given a |window| of |encodingSpecificWindowLength| units encoding valid 1492 * Unicode text, with index |encodingSpecificTokenOffset| indicating a 1493 * particular code point boundary in |window|, compute the corresponding 1494 * token offset and length if |window| were encoded in UTF-16. For 1495 * example: 1496 * 1497 * // U+03C0 GREEK SMALL LETTER PI is encoded as 0xCF 0x80. 1498 * const Utf8Unit* encodedWindow = 1499 * reinterpret_cast<const Utf8Unit*>(u8"ππππ = @ FAIL"); 1500 * size_t encodedTokenOffset = 11; // 2 * 4 + ' = '.length 1501 * size_t encodedWindowLength = 17; // 2 * 4 + ' = @ FAIL'.length 1502 * size_t utf16Offset, utf16Length; 1503 * computeWindowOffsetAndLength(encodedWindow, 1504 * encodedTokenOffset, &utf16Offset, 1505 * encodedWindowLength, &utf16Length); 1506 * MOZ_ASSERT(utf16Offset == 7); 1507 * MOZ_ASSERT(utf16Length = 13); 1508 * 1509 * This function asserts if called for UTF-16: the sole caller can avoid 1510 * computing UTF-16 offsets when they're definitely the same as the encoded 1511 * offsets. 1512 */ 1513 inline void computeWindowOffsetAndLength(const Unit* encodeWindow, 1514 size_t encodingSpecificTokenOffset, 1515 size_t* utf16TokenOffset, 1516 size_t encodingSpecificWindowLength, 1517 size_t* utf16WindowLength) const; 1518 }; 1519 1520 template <> 1521 inline void SourceUnits<char16_t>::ungetLineOrParagraphSeparator() { 1522 #ifdef DEBUG 1523 char16_t prev = previousCodeUnit(); 1524 #endif 1525 MOZ_ASSERT(prev == unicode::LINE_SEPARATOR || 1526 prev == unicode::PARA_SEPARATOR); 1527 1528 ungetCodeUnit(); 1529 } 1530 1531 template <> 1532 inline void SourceUnits<mozilla::Utf8Unit>::ungetLineOrParagraphSeparator() { 1533 unskipCodeUnits(3); 1534 1535 MOZ_ASSERT(ptr[0].toUint8() == 0xE2); 1536 MOZ_ASSERT(ptr[1].toUint8() == 0x80); 1537 1538 #ifdef DEBUG 1539 uint8_t last = ptr[2].toUint8(); 1540 #endif 1541 MOZ_ASSERT(last == 0xA8 || last == 0xA9); 1542 } 1543 1544 /** 1545 * An all-purpose buffer type for accumulating text during tokenizing. 1546 * 1547 * In principle we could make this buffer contain |char16_t|, |Utf8Unit|, or 1548 * |Unit|. We use |char16_t| because: 1549 * 1550 * * we don't have a UTF-8 regular expression parser, so in general regular 1551 * expression text must be copied to a separate UTF-16 buffer to parse it, 1552 * and 1553 * * |TokenStreamCharsShared::copyCharBufferTo|, which copies a shared 1554 * |CharBuffer| to a |char16_t*|, is simpler if it doesn't have to convert. 1555 */ 1556 using CharBuffer = Vector<char16_t, 32>; 1557 1558 /** 1559 * Append the provided code point (in the range [U+0000, U+10FFFF], surrogate 1560 * code points included) to the buffer. 1561 */ 1562 [[nodiscard]] extern bool AppendCodePointToCharBuffer(CharBuffer& charBuffer, 1563 char32_t codePoint); 1564 1565 /** 1566 * Accumulate the range of UTF-16 text (lone surrogates permitted, because JS 1567 * allows them in source text) into |charBuffer|. Normalize '\r', '\n', and 1568 * "\r\n" into '\n'. 1569 */ 1570 [[nodiscard]] extern bool FillCharBufferFromSourceNormalizingAsciiLineBreaks( 1571 CharBuffer& charBuffer, const char16_t* cur, const char16_t* end); 1572 1573 /** 1574 * Accumulate the range of previously-validated UTF-8 text into |charBuffer|. 1575 * Normalize '\r', '\n', and "\r\n" into '\n'. 1576 */ 1577 [[nodiscard]] extern bool FillCharBufferFromSourceNormalizingAsciiLineBreaks( 1578 CharBuffer& charBuffer, const mozilla::Utf8Unit* cur, 1579 const mozilla::Utf8Unit* end); 1580 1581 class TokenStreamCharsShared { 1582 protected: 1583 FrontendContext* fc; 1584 1585 /** 1586 * Buffer transiently used to store sequences of identifier or string code 1587 * points when such can't be directly processed from the original source 1588 * text (e.g. because it contains escapes). 1589 */ 1590 CharBuffer charBuffer; 1591 1592 /** Information for parsing with a lifetime longer than the parser itself. */ 1593 ParserAtomsTable* parserAtoms; 1594 1595 protected: 1596 explicit TokenStreamCharsShared(FrontendContext* fc, 1597 ParserAtomsTable* parserAtoms) 1598 : fc(fc), charBuffer(fc), parserAtoms(parserAtoms) {} 1599 1600 [[nodiscard]] bool copyCharBufferTo( 1601 UniquePtr<char16_t[], JS::FreePolicy>* destination); 1602 1603 /** 1604 * Determine whether a code unit constitutes a complete ASCII code point. 1605 * (The code point's exact value might not be used, however, if subsequent 1606 * code observes that |unit| is part of a LineTerminatorSequence.) 1607 */ 1608 [[nodiscard]] static constexpr MOZ_ALWAYS_INLINE bool isAsciiCodePoint( 1609 int32_t unit) { 1610 return mozilla::IsAscii(static_cast<char32_t>(unit)); 1611 } 1612 1613 TaggedParserAtomIndex drainCharBufferIntoAtom() { 1614 // Add to parser atoms table. 1615 auto atom = this->parserAtoms->internChar16(fc, charBuffer.begin(), 1616 charBuffer.length()); 1617 charBuffer.clear(); 1618 return atom; 1619 } 1620 1621 protected: 1622 void adoptState(TokenStreamCharsShared& other) { 1623 // The other stream's buffer may contain information for a 1624 // gotten-then-ungotten token, that we must transfer into this stream so 1625 // that token's final get behaves as desired. 1626 charBuffer = std::move(other.charBuffer); 1627 } 1628 1629 public: 1630 CharBuffer& getCharBuffer() { return charBuffer; } 1631 }; 1632 1633 template <typename Unit> 1634 class TokenStreamCharsBase : public TokenStreamCharsShared { 1635 protected: 1636 using SourceUnits = frontend::SourceUnits<Unit>; 1637 1638 /** Code units in the source code being tokenized. */ 1639 SourceUnits sourceUnits; 1640 1641 // End of fields. 1642 1643 protected: 1644 TokenStreamCharsBase(FrontendContext* fc, ParserAtomsTable* parserAtoms, 1645 const Unit* units, size_t length, size_t startOffset); 1646 1647 /** 1648 * Convert a non-EOF code unit returned by |getCodeUnit()| or 1649 * |peekCodeUnit()| to a Unit code unit. 1650 */ 1651 inline Unit toUnit(int32_t codeUnitValue); 1652 1653 void ungetCodeUnit(int32_t c) { 1654 if (c == EOF) { 1655 MOZ_ASSERT(sourceUnits.atEnd()); 1656 return; 1657 } 1658 1659 MOZ_ASSERT(sourceUnits.previousCodeUnit() == toUnit(c)); 1660 sourceUnits.ungetCodeUnit(); 1661 } 1662 1663 MOZ_ALWAYS_INLINE TaggedParserAtomIndex 1664 atomizeSourceChars(mozilla::Span<const Unit> units); 1665 1666 /** 1667 * Try to match a non-LineTerminator ASCII code point. Return true iff it 1668 * was matched. 1669 */ 1670 bool matchCodeUnit(char expect) { 1671 MOZ_ASSERT(mozilla::IsAscii(expect)); 1672 MOZ_ASSERT(expect != '\r'); 1673 MOZ_ASSERT(expect != '\n'); 1674 return this->sourceUnits.internalMatchCodeUnit(Unit(expect)); 1675 } 1676 1677 /** 1678 * Try to match an ASCII LineTerminator code point. Return true iff it was 1679 * matched. 1680 */ 1681 MOZ_NEVER_INLINE bool matchLineTerminator(char expect) { 1682 MOZ_ASSERT(expect == '\r' || expect == '\n'); 1683 return this->sourceUnits.internalMatchCodeUnit(Unit(expect)); 1684 } 1685 1686 template <typename T> 1687 bool matchCodeUnit(T) = delete; 1688 template <typename T> 1689 bool matchLineTerminator(T) = delete; 1690 1691 int32_t peekCodeUnit() { 1692 return MOZ_LIKELY(!sourceUnits.atEnd()) 1693 ? CodeUnitValue(sourceUnits.peekCodeUnit()) 1694 : EOF; 1695 } 1696 1697 /** Consume a known, non-EOF code unit. */ 1698 inline void consumeKnownCodeUnit(int32_t unit); 1699 1700 // Forbid accidental calls to consumeKnownCodeUnit *not* with the single 1701 // unit-or-EOF type. Unit should use SourceUnits::consumeKnownCodeUnit; 1702 // CodeUnitValue() results should go through toUnit(), or better yet just 1703 // use the original Unit. 1704 template <typename T> 1705 inline void consumeKnownCodeUnit(T) = delete; 1706 1707 /** 1708 * Add a null-terminated line of context to error information, for the line 1709 * in |sourceUnits| that contains |offset|. Also record the window's 1710 * length and the offset of the error in the window. (Don't bother adding 1711 * a line of context if it would be empty.) 1712 * 1713 * The window will contain no LineTerminators of any kind, and it will not 1714 * extend more than |SourceUnits::WindowRadius| to either side of |offset|, 1715 * nor into the previous or next lines. 1716 * 1717 * This function is quite internal, and you probably should be calling one 1718 * of its existing callers instead. 1719 */ 1720 [[nodiscard]] bool addLineOfContext(ErrorMetadata* err, 1721 uint32_t offset) const; 1722 }; 1723 1724 template <> 1725 inline char16_t TokenStreamCharsBase<char16_t>::toUnit(int32_t codeUnitValue) { 1726 MOZ_ASSERT(codeUnitValue != EOF, "EOF is not a Unit"); 1727 return mozilla::AssertedCast<char16_t>(codeUnitValue); 1728 } 1729 1730 template <> 1731 inline mozilla::Utf8Unit TokenStreamCharsBase<mozilla::Utf8Unit>::toUnit( 1732 int32_t value) { 1733 MOZ_ASSERT(value != EOF, "EOF is not a Unit"); 1734 return mozilla::Utf8Unit(mozilla::AssertedCast<unsigned char>(value)); 1735 } 1736 1737 template <typename Unit> 1738 inline void TokenStreamCharsBase<Unit>::consumeKnownCodeUnit(int32_t unit) { 1739 sourceUnits.consumeKnownCodeUnit(toUnit(unit)); 1740 } 1741 1742 template <> 1743 MOZ_ALWAYS_INLINE TaggedParserAtomIndex 1744 TokenStreamCharsBase<char16_t>::atomizeSourceChars( 1745 mozilla::Span<const char16_t> units) { 1746 return this->parserAtoms->internChar16(fc, units.data(), units.size()); 1747 } 1748 1749 template <> 1750 /* static */ MOZ_ALWAYS_INLINE TaggedParserAtomIndex 1751 TokenStreamCharsBase<mozilla::Utf8Unit>::atomizeSourceChars( 1752 mozilla::Span<const mozilla::Utf8Unit> units) { 1753 return this->parserAtoms->internUtf8(fc, units.data(), units.size()); 1754 } 1755 1756 template <typename Unit> 1757 class SpecializedTokenStreamCharsBase; 1758 1759 template <> 1760 class SpecializedTokenStreamCharsBase<char16_t> 1761 : public TokenStreamCharsBase<char16_t> { 1762 using CharsBase = TokenStreamCharsBase<char16_t>; 1763 1764 protected: 1765 using TokenStreamCharsShared::isAsciiCodePoint; 1766 // Deliberately don't |using| |sourceUnits| because of bug 1472569. :-( 1767 1768 using typename CharsBase::SourceUnits; 1769 1770 protected: 1771 // These APIs are only usable by UTF-16-specific code. 1772 1773 /** 1774 * Given |lead| already consumed, consume and return the code point encoded 1775 * starting from it. Infallible because lone surrogates in JS encode a 1776 * "code point" of the same value. 1777 */ 1778 char32_t infallibleGetNonAsciiCodePointDontNormalize(char16_t lead) { 1779 MOZ_ASSERT(!isAsciiCodePoint(lead)); 1780 MOZ_ASSERT(this->sourceUnits.previousCodeUnit() == lead); 1781 1782 // Handle single-unit code points and lone trailing surrogates. 1783 if (MOZ_LIKELY(!unicode::IsLeadSurrogate(lead)) || 1784 // Or handle lead surrogates not paired with trailing surrogates. 1785 MOZ_UNLIKELY( 1786 this->sourceUnits.atEnd() || 1787 !unicode::IsTrailSurrogate(this->sourceUnits.peekCodeUnit()))) { 1788 return lead; 1789 } 1790 1791 // Otherwise it's a multi-unit code point. 1792 return unicode::UTF16Decode(lead, this->sourceUnits.getCodeUnit()); 1793 } 1794 1795 protected: 1796 // These APIs are in both SpecializedTokenStreamCharsBase specializations 1797 // and so are usable in subclasses no matter what Unit is. 1798 1799 using CharsBase::CharsBase; 1800 }; 1801 1802 template <> 1803 class SpecializedTokenStreamCharsBase<mozilla::Utf8Unit> 1804 : public TokenStreamCharsBase<mozilla::Utf8Unit> { 1805 using CharsBase = TokenStreamCharsBase<mozilla::Utf8Unit>; 1806 1807 protected: 1808 // Deliberately don't |using| |sourceUnits| because of bug 1472569. :-( 1809 1810 protected: 1811 // These APIs are only usable by UTF-8-specific code. 1812 1813 using typename CharsBase::SourceUnits; 1814 1815 /** 1816 * A mutable iterator-wrapper around |SourceUnits| that translates 1817 * operators to calls to |SourceUnits::getCodeUnit()| and similar. 1818 * 1819 * This class is expected to be used in concert with |SourceUnitsEnd|. 1820 */ 1821 class SourceUnitsIterator { 1822 SourceUnits& sourceUnits_; 1823 #ifdef DEBUG 1824 // In iterator copies created by the post-increment operator, a pointer 1825 // at the next source text code unit when the post-increment operator 1826 // was called, cleared when the iterator is dereferenced. 1827 mutable mozilla::Maybe<const mozilla::Utf8Unit*> 1828 currentBeforePostIncrement_; 1829 #endif 1830 1831 public: 1832 explicit SourceUnitsIterator(SourceUnits& sourceUnits) 1833 : sourceUnits_(sourceUnits) {} 1834 1835 mozilla::Utf8Unit operator*() const { 1836 // operator* is expected to get the *next* value from an iterator 1837 // not pointing at the end of the underlying range. However, the 1838 // sole use of this is in the context of an expression of the form 1839 // |*iter++|, that performed the |sourceUnits_.getCodeUnit()| in 1840 // the |operator++(int)| below -- so dereferencing acts on a 1841 // |sourceUnits_| already advanced. Therefore the correct unit to 1842 // return is the previous one. 1843 MOZ_ASSERT(currentBeforePostIncrement_.value() + 1 == 1844 sourceUnits_.current()); 1845 #ifdef DEBUG 1846 currentBeforePostIncrement_.reset(); 1847 #endif 1848 return sourceUnits_.previousCodeUnit(); 1849 } 1850 1851 SourceUnitsIterator operator++(int) { 1852 MOZ_ASSERT(currentBeforePostIncrement_.isNothing(), 1853 "the only valid operation on a post-incremented " 1854 "iterator is dereferencing a single time"); 1855 1856 SourceUnitsIterator copy = *this; 1857 #ifdef DEBUG 1858 copy.currentBeforePostIncrement_.emplace(sourceUnits_.current()); 1859 #endif 1860 1861 sourceUnits_.getCodeUnit(); 1862 return copy; 1863 } 1864 1865 void operator-=(size_t n) { 1866 MOZ_ASSERT(currentBeforePostIncrement_.isNothing(), 1867 "the only valid operation on a post-incremented " 1868 "iterator is dereferencing a single time"); 1869 sourceUnits_.unskipCodeUnits(n); 1870 } 1871 1872 mozilla::Utf8Unit operator[](ptrdiff_t index) { 1873 MOZ_ASSERT(currentBeforePostIncrement_.isNothing(), 1874 "the only valid operation on a post-incremented " 1875 "iterator is dereferencing a single time"); 1876 MOZ_ASSERT(index == -1, 1877 "must only be called to verify the value of the " 1878 "previous code unit"); 1879 return sourceUnits_.previousCodeUnit(); 1880 } 1881 1882 size_t remaining() const { 1883 MOZ_ASSERT(currentBeforePostIncrement_.isNothing(), 1884 "the only valid operation on a post-incremented " 1885 "iterator is dereferencing a single time"); 1886 return sourceUnits_.remaining(); 1887 } 1888 }; 1889 1890 /** A sentinel representing the end of |SourceUnits| data. */ 1891 class SourceUnitsEnd {}; 1892 1893 friend inline size_t operator-(const SourceUnitsEnd& aEnd, 1894 const SourceUnitsIterator& aIter); 1895 1896 protected: 1897 // These APIs are in both SpecializedTokenStreamCharsBase specializations 1898 // and so are usable in subclasses no matter what Unit is. 1899 1900 using CharsBase::CharsBase; 1901 }; 1902 1903 inline size_t operator-(const SpecializedTokenStreamCharsBase< 1904 mozilla::Utf8Unit>::SourceUnitsEnd& aEnd, 1905 const SpecializedTokenStreamCharsBase< 1906 mozilla::Utf8Unit>::SourceUnitsIterator& aIter) { 1907 return aIter.remaining(); 1908 } 1909 1910 /** A small class encapsulating computation of the start-offset of a Token. */ 1911 class TokenStart { 1912 uint32_t startOffset_; 1913 1914 public: 1915 /** 1916 * Compute a starting offset that is the current offset of |sourceUnits|, 1917 * offset by |adjust|. (For example, |adjust| of -1 indicates the code 1918 * unit one backwards from |sourceUnits|'s current offset.) 1919 */ 1920 template <class SourceUnits> 1921 TokenStart(const SourceUnits& sourceUnits, ptrdiff_t adjust) 1922 : startOffset_(sourceUnits.offset() + adjust) {} 1923 1924 TokenStart(const TokenStart&) = default; 1925 1926 uint32_t offset() const { return startOffset_; } 1927 }; 1928 1929 template <typename Unit, class AnyCharsAccess> 1930 class GeneralTokenStreamChars : public SpecializedTokenStreamCharsBase<Unit> { 1931 using CharsBase = TokenStreamCharsBase<Unit>; 1932 using SpecializedCharsBase = SpecializedTokenStreamCharsBase<Unit>; 1933 1934 using LineToken = TokenStreamAnyChars::LineToken; 1935 1936 private: 1937 Token* newTokenInternal(TokenKind kind, TokenStart start, TokenKind* out); 1938 1939 /** 1940 * Allocates a new Token from the given offset to the current offset, 1941 * ascribes it the given kind, and sets |*out| to that kind. 1942 */ 1943 Token* newToken(TokenKind kind, TokenStart start, 1944 TokenStreamShared::Modifier modifier, TokenKind* out) { 1945 Token* token = newTokenInternal(kind, start, out); 1946 1947 #ifdef DEBUG 1948 // Save the modifier used to get this token, so that if an ungetToken() 1949 // occurs and then the token is re-gotten (or peeked, etc.), we can 1950 // assert both gets used compatible modifiers. 1951 token->modifier = modifier; 1952 #endif 1953 1954 return token; 1955 } 1956 1957 uint32_t matchUnicodeEscape(char32_t* codePoint); 1958 uint32_t matchExtendedUnicodeEscape(char32_t* codePoint); 1959 1960 protected: 1961 using CharsBase::addLineOfContext; 1962 using CharsBase::matchCodeUnit; 1963 using CharsBase::matchLineTerminator; 1964 using TokenStreamCharsShared::drainCharBufferIntoAtom; 1965 using TokenStreamCharsShared::isAsciiCodePoint; 1966 // Deliberately don't |using CharsBase::sourceUnits| because of bug 1472569. 1967 // :-( 1968 using CharsBase::toUnit; 1969 1970 using typename CharsBase::SourceUnits; 1971 1972 protected: 1973 using SpecializedCharsBase::SpecializedCharsBase; 1974 1975 TokenStreamAnyChars& anyCharsAccess() { 1976 return AnyCharsAccess::anyChars(this); 1977 } 1978 1979 const TokenStreamAnyChars& anyCharsAccess() const { 1980 return AnyCharsAccess::anyChars(this); 1981 } 1982 1983 using TokenStreamSpecific = 1984 frontend::TokenStreamSpecific<Unit, AnyCharsAccess>; 1985 1986 TokenStreamSpecific* asSpecific() { 1987 static_assert( 1988 std::is_base_of_v<GeneralTokenStreamChars, TokenStreamSpecific>, 1989 "static_cast below presumes an inheritance relationship"); 1990 1991 return static_cast<TokenStreamSpecific*>(this); 1992 } 1993 1994 protected: 1995 /** 1996 * Compute the column number in Unicode code points of the absolute |offset| 1997 * within source text on the line corresponding to |lineToken|. 1998 * 1999 * |offset| must be a code point boundary, preceded only by validly-encoded 2000 * source units. (It doesn't have to be *followed* by valid source units.) 2001 */ 2002 JS::LimitedColumnNumberOneOrigin computeColumn(LineToken lineToken, 2003 uint32_t offset) const; 2004 void computeLineAndColumn(uint32_t offset, uint32_t* line, 2005 JS::LimitedColumnNumberOneOrigin* column) const; 2006 2007 /** 2008 * Fill in |err| completely, except for line-of-context information. 2009 * 2010 * Return true if the caller can compute a line of context from the token 2011 * stream. Otherwise return false. 2012 */ 2013 [[nodiscard]] bool fillExceptingContext(ErrorMetadata* err, 2014 uint32_t offset) const { 2015 if (anyCharsAccess().fillExceptingContext(err, offset)) { 2016 JS::LimitedColumnNumberOneOrigin columnNumber; 2017 computeLineAndColumn(offset, &err->lineNumber, &columnNumber); 2018 err->columnNumber = JS::ColumnNumberOneOrigin(columnNumber); 2019 return true; 2020 } 2021 return false; 2022 } 2023 2024 void newSimpleToken(TokenKind kind, TokenStart start, 2025 TokenStreamShared::Modifier modifier, TokenKind* out) { 2026 newToken(kind, start, modifier, out); 2027 } 2028 2029 void newNumberToken(double dval, DecimalPoint decimalPoint, TokenStart start, 2030 TokenStreamShared::Modifier modifier, TokenKind* out) { 2031 Token* token = newToken(TokenKind::Number, start, modifier, out); 2032 token->setNumber(dval, decimalPoint); 2033 } 2034 2035 void newBigIntToken(TokenStart start, TokenStreamShared::Modifier modifier, 2036 TokenKind* out) { 2037 newToken(TokenKind::BigInt, start, modifier, out); 2038 } 2039 2040 void newAtomToken(TokenKind kind, TaggedParserAtomIndex atom, 2041 TokenStart start, TokenStreamShared::Modifier modifier, 2042 TokenKind* out) { 2043 MOZ_ASSERT(kind == TokenKind::String || kind == TokenKind::TemplateHead || 2044 kind == TokenKind::NoSubsTemplate); 2045 2046 Token* token = newToken(kind, start, modifier, out); 2047 token->setAtom(atom); 2048 } 2049 2050 void newNameToken(TaggedParserAtomIndex name, TokenStart start, 2051 TokenStreamShared::Modifier modifier, TokenKind* out) { 2052 Token* token = newToken(TokenKind::Name, start, modifier, out); 2053 token->setName(name); 2054 } 2055 2056 void newPrivateNameToken(TaggedParserAtomIndex name, TokenStart start, 2057 TokenStreamShared::Modifier modifier, 2058 TokenKind* out) { 2059 Token* token = newToken(TokenKind::PrivateName, start, modifier, out); 2060 token->setName(name); 2061 } 2062 2063 void newRegExpToken(JS::RegExpFlags reflags, TokenStart start, 2064 TokenKind* out) { 2065 Token* token = newToken(TokenKind::RegExp, start, 2066 TokenStreamShared::SlashIsRegExp, out); 2067 token->setRegExpFlags(reflags); 2068 } 2069 2070 MOZ_COLD bool badToken(); 2071 2072 /** 2073 * Get the next code unit -- the next numeric sub-unit of source text, 2074 * possibly smaller than a full code point -- without updating line/column 2075 * counters or consuming LineTerminatorSequences. 2076 * 2077 * Because of these limitations, only use this if (a) the resulting code 2078 * unit is guaranteed to be ungotten (by ungetCodeUnit()) if it's an EOL, 2079 * and (b) the line-related state (lineno, linebase) is not used before 2080 * it's ungotten. 2081 */ 2082 int32_t getCodeUnit() { 2083 if (MOZ_LIKELY(!this->sourceUnits.atEnd())) { 2084 return CodeUnitValue(this->sourceUnits.getCodeUnit()); 2085 } 2086 2087 anyCharsAccess().flags.isEOF = true; 2088 return EOF; 2089 } 2090 2091 void ungetCodeUnit(int32_t c) { 2092 MOZ_ASSERT_IF(c == EOF, anyCharsAccess().flags.isEOF); 2093 2094 CharsBase::ungetCodeUnit(c); 2095 } 2096 2097 /** 2098 * Given a just-consumed ASCII code unit/point |lead|, consume a full code 2099 * point or LineTerminatorSequence (normalizing it to '\n'). Return true on 2100 * success, otherwise return false. 2101 * 2102 * If a LineTerminatorSequence was consumed, also update line/column info. 2103 * 2104 * This may change the current |sourceUnits| offset. 2105 */ 2106 [[nodiscard]] MOZ_ALWAYS_INLINE bool getFullAsciiCodePoint(int32_t lead) { 2107 MOZ_ASSERT(isAsciiCodePoint(lead), 2108 "non-ASCII code units must be handled separately"); 2109 MOZ_ASSERT(toUnit(lead) == this->sourceUnits.previousCodeUnit(), 2110 "getFullAsciiCodePoint called incorrectly"); 2111 2112 if (MOZ_UNLIKELY(lead == '\r')) { 2113 matchLineTerminator('\n'); 2114 } else if (MOZ_LIKELY(lead != '\n')) { 2115 return true; 2116 } 2117 return updateLineInfoForEOL(); 2118 } 2119 2120 [[nodiscard]] MOZ_NEVER_INLINE bool updateLineInfoForEOL() { 2121 return anyCharsAccess().internalUpdateLineInfoForEOL( 2122 this->sourceUnits.offset()); 2123 } 2124 2125 uint32_t matchUnicodeEscapeIdStart(char32_t* codePoint); 2126 bool matchUnicodeEscapeIdent(char32_t* codePoint); 2127 bool matchIdentifierStart(); 2128 2129 /** 2130 * If possible, compute a line of context for an otherwise-filled-in |err| 2131 * at the given offset in this token stream. 2132 * 2133 * This function is very-internal: almost certainly you should use one of 2134 * its callers instead. It basically exists only to make those callers 2135 * more readable. 2136 */ 2137 [[nodiscard]] bool internalComputeLineOfContext(ErrorMetadata* err, 2138 uint32_t offset) const { 2139 // We only have line-start information for the current line. If the error 2140 // is on a different line, we can't easily provide context. (This means 2141 // any error in a multi-line token, e.g. an unterminated multiline string 2142 // literal, won't have context.) 2143 if (err->lineNumber != anyCharsAccess().lineno) { 2144 return true; 2145 } 2146 2147 return addLineOfContext(err, offset); 2148 } 2149 2150 public: 2151 /** 2152 * Consume any hashbang comment at the start of a Script or Module, if one is 2153 * present. Stops consuming just before any terminating LineTerminator or 2154 * before an encoding error is encountered. 2155 */ 2156 void consumeOptionalHashbangComment(); 2157 2158 TaggedParserAtomIndex getRawTemplateStringAtom() { 2159 TokenStreamAnyChars& anyChars = anyCharsAccess(); 2160 2161 MOZ_ASSERT(anyChars.currentToken().type == TokenKind::TemplateHead || 2162 anyChars.currentToken().type == TokenKind::NoSubsTemplate); 2163 const Unit* cur = 2164 this->sourceUnits.codeUnitPtrAt(anyChars.currentToken().pos.begin + 1); 2165 const Unit* end; 2166 if (anyChars.currentToken().type == TokenKind::TemplateHead) { 2167 // Of the form |`...${| or |}...${| 2168 end = 2169 this->sourceUnits.codeUnitPtrAt(anyChars.currentToken().pos.end - 2); 2170 } else { 2171 // NoSubsTemplate is of the form |`...`| or |}...`| 2172 end = 2173 this->sourceUnits.codeUnitPtrAt(anyChars.currentToken().pos.end - 1); 2174 } 2175 2176 // |charBuffer| should be empty here, but we may as well code defensively. 2177 MOZ_ASSERT(this->charBuffer.length() == 0); 2178 this->charBuffer.clear(); 2179 2180 // Template literals normalize only '\r' and "\r\n" to '\n'; Unicode 2181 // separators don't need special handling. 2182 // https://tc39.github.io/ecma262/#sec-static-semantics-tv-and-trv 2183 if (!FillCharBufferFromSourceNormalizingAsciiLineBreaks(this->charBuffer, 2184 cur, end)) { 2185 return TaggedParserAtomIndex::null(); 2186 } 2187 2188 return drainCharBufferIntoAtom(); 2189 } 2190 }; 2191 2192 template <typename Unit, class AnyCharsAccess> 2193 class TokenStreamChars; 2194 2195 template <class AnyCharsAccess> 2196 class TokenStreamChars<char16_t, AnyCharsAccess> 2197 : public GeneralTokenStreamChars<char16_t, AnyCharsAccess> { 2198 using CharsBase = TokenStreamCharsBase<char16_t>; 2199 using SpecializedCharsBase = SpecializedTokenStreamCharsBase<char16_t>; 2200 using GeneralCharsBase = GeneralTokenStreamChars<char16_t, AnyCharsAccess>; 2201 using Self = TokenStreamChars<char16_t, AnyCharsAccess>; 2202 2203 using GeneralCharsBase::asSpecific; 2204 2205 using typename GeneralCharsBase::TokenStreamSpecific; 2206 2207 protected: 2208 using CharsBase::matchLineTerminator; 2209 using GeneralCharsBase::anyCharsAccess; 2210 using GeneralCharsBase::getCodeUnit; 2211 using SpecializedCharsBase::infallibleGetNonAsciiCodePointDontNormalize; 2212 using TokenStreamCharsShared::isAsciiCodePoint; 2213 // Deliberately don't |using| |sourceUnits| because of bug 1472569. :-( 2214 using GeneralCharsBase::ungetCodeUnit; 2215 using GeneralCharsBase::updateLineInfoForEOL; 2216 2217 protected: 2218 using GeneralCharsBase::GeneralCharsBase; 2219 2220 /** 2221 * Given the non-ASCII |lead| code unit just consumed, consume and return a 2222 * complete non-ASCII code point. Line/column updates are not performed, 2223 * and line breaks are returned as-is without normalization. 2224 */ 2225 [[nodiscard]] bool getNonAsciiCodePointDontNormalize(char16_t lead, 2226 char32_t* codePoint) { 2227 // There are no encoding errors in 16-bit JS, so implement this so that 2228 // the compiler knows it, too. 2229 *codePoint = infallibleGetNonAsciiCodePointDontNormalize(lead); 2230 return true; 2231 } 2232 2233 /** 2234 * Given a just-consumed non-ASCII code unit |lead| (which may also be a 2235 * full code point, for UTF-16), consume a full code point or 2236 * LineTerminatorSequence (normalizing it to '\n') and store it in 2237 * |*codePoint|. Return true on success, otherwise return false and leave 2238 * |*codePoint| undefined on failure. 2239 * 2240 * If a LineTerminatorSequence was consumed, also update line/column info. 2241 * 2242 * This may change the current |sourceUnits| offset. 2243 */ 2244 [[nodiscard]] bool getNonAsciiCodePoint(int32_t lead, char32_t* codePoint); 2245 }; 2246 2247 template <class AnyCharsAccess> 2248 class TokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess> 2249 : public GeneralTokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess> { 2250 using CharsBase = TokenStreamCharsBase<mozilla::Utf8Unit>; 2251 using SpecializedCharsBase = 2252 SpecializedTokenStreamCharsBase<mozilla::Utf8Unit>; 2253 using GeneralCharsBase = 2254 GeneralTokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess>; 2255 using Self = TokenStreamChars<mozilla::Utf8Unit, AnyCharsAccess>; 2256 2257 using typename SpecializedCharsBase::SourceUnitsEnd; 2258 using typename SpecializedCharsBase::SourceUnitsIterator; 2259 2260 protected: 2261 using GeneralCharsBase::anyCharsAccess; 2262 using GeneralCharsBase::computeLineAndColumn; 2263 using GeneralCharsBase::fillExceptingContext; 2264 using GeneralCharsBase::internalComputeLineOfContext; 2265 using TokenStreamCharsShared::isAsciiCodePoint; 2266 // Deliberately don't |using| |sourceUnits| because of bug 1472569. :-( 2267 using GeneralCharsBase::updateLineInfoForEOL; 2268 2269 private: 2270 static char toHexChar(uint8_t nibble) { 2271 MOZ_ASSERT(nibble < 16); 2272 return "0123456789ABCDEF"[nibble]; 2273 } 2274 2275 static void byteToString(uint8_t n, char* str) { 2276 str[0] = '0'; 2277 str[1] = 'x'; 2278 str[2] = toHexChar(n >> 4); 2279 str[3] = toHexChar(n & 0xF); 2280 } 2281 2282 static void byteToTerminatedString(uint8_t n, char* str) { 2283 byteToString(n, str); 2284 str[4] = '\0'; 2285 } 2286 2287 /** 2288 * Report a UTF-8 encoding-related error for a code point starting AT THE 2289 * CURRENT OFFSET. 2290 * 2291 * |relevantUnits| indicates how many code units from the current offset 2292 * are potentially relevant to the reported error, such that they may be 2293 * included in the error message. For example, if at the current offset we 2294 * have 2295 * 2296 * 0b1111'1111 ... 2297 * 2298 * a code unit never allowed in UTF-8, then |relevantUnits| might be 1 2299 * because only that unit is relevant. Or if we have 2300 * 2301 * 0b1111'0111 0b1011'0101 0b0000'0000 ... 2302 * 2303 * where the first two code units are a valid prefix to a four-unit code 2304 * point but the third unit *isn't* a valid trailing code unit, then 2305 * |relevantUnits| might be 3. 2306 */ 2307 MOZ_COLD void internalEncodingError(uint8_t relevantUnits, 2308 unsigned errorNumber, ...); 2309 2310 // Don't use |internalEncodingError|! Use one of the elaborated functions 2311 // that calls it, below -- all of which should be used to indicate an error 2312 // in a code point starting AT THE CURRENT OFFSET as with 2313 // |internalEncodingError|. 2314 2315 /** Report an error for an invalid lead code unit |lead|. */ 2316 MOZ_COLD void badLeadUnit(mozilla::Utf8Unit lead); 2317 2318 /** 2319 * Report an error when there aren't enough code units remaining to 2320 * constitute a full code point after |lead|: only |remaining| code units 2321 * were available for a code point starting with |lead|, when at least 2322 * |required| code units were required. 2323 */ 2324 MOZ_COLD void notEnoughUnits(mozilla::Utf8Unit lead, uint8_t remaining, 2325 uint8_t required); 2326 2327 /** 2328 * Report an error for a bad trailing UTF-8 code unit, where the bad 2329 * trailing unit was the last of |unitsObserved| units examined from the 2330 * current offset. 2331 */ 2332 MOZ_COLD void badTrailingUnit(uint8_t unitsObserved); 2333 2334 // Helper used for both |badCodePoint| and |notShortestForm| for code units 2335 // that have all the requisite high bits set/unset in a manner that *could* 2336 // encode a valid code point, but the remaining bits encoding its actual 2337 // value do not define a permitted value. 2338 MOZ_COLD void badStructurallyValidCodePoint(char32_t codePoint, 2339 uint8_t codePointLength, 2340 const char* reason); 2341 2342 /** 2343 * Report an error for UTF-8 that encodes a UTF-16 surrogate or a number 2344 * outside the Unicode range. 2345 */ 2346 MOZ_COLD void badCodePoint(char32_t codePoint, uint8_t codePointLength) { 2347 MOZ_ASSERT(unicode::IsSurrogate(codePoint) || 2348 codePoint > unicode::NonBMPMax); 2349 2350 badStructurallyValidCodePoint(codePoint, codePointLength, 2351 unicode::IsSurrogate(codePoint) 2352 ? "it's a UTF-16 surrogate" 2353 : "the maximum code point is U+10FFFF"); 2354 } 2355 2356 /** 2357 * Report an error for UTF-8 that encodes a code point not in its shortest 2358 * form. 2359 */ 2360 MOZ_COLD void notShortestForm(char32_t codePoint, uint8_t codePointLength) { 2361 MOZ_ASSERT(!unicode::IsSurrogate(codePoint)); 2362 MOZ_ASSERT(codePoint <= unicode::NonBMPMax); 2363 2364 badStructurallyValidCodePoint( 2365 codePoint, codePointLength, 2366 "it wasn't encoded in shortest possible form"); 2367 } 2368 2369 protected: 2370 using GeneralCharsBase::GeneralCharsBase; 2371 2372 /** 2373 * Given the non-ASCII |lead| code unit just consumed, consume the rest of 2374 * a non-ASCII code point. The code point is not normalized: on success 2375 * |*codePoint| may be U+2028 LINE SEPARATOR or U+2029 PARAGRAPH SEPARATOR. 2376 * 2377 * Report an error if an invalid code point is encountered. 2378 */ 2379 [[nodiscard]] bool getNonAsciiCodePointDontNormalize(mozilla::Utf8Unit lead, 2380 char32_t* codePoint); 2381 2382 /** 2383 * Given a just-consumed non-ASCII code unit |lead|, consume a full code 2384 * point or LineTerminatorSequence (normalizing it to '\n') and store it in 2385 * |*codePoint|. Return true on success, otherwise return false and leave 2386 * |*codePoint| undefined on failure. 2387 * 2388 * If a LineTerminatorSequence was consumed, also update line/column info. 2389 * 2390 * This function will change the current |sourceUnits| offset. 2391 */ 2392 [[nodiscard]] bool getNonAsciiCodePoint(int32_t lead, char32_t* codePoint); 2393 }; 2394 2395 // TokenStream is the lexical scanner for JavaScript source text. 2396 // 2397 // It takes a buffer of Unit code units (currently only char16_t encoding 2398 // UTF-16, but we're adding either UTF-8 or Latin-1 single-byte text soon) and 2399 // linearly scans it into |Token|s. 2400 // 2401 // Internally the class uses a four element circular buffer |tokens| of 2402 // |Token|s. As an index for |tokens|, the member |cursor_| points to the 2403 // current token. Calls to getToken() increase |cursor_| by one and return the 2404 // new current token. If a TokenStream was just created, the current token is 2405 // uninitialized. It's therefore important that one of the first four member 2406 // functions listed below is called first. The circular buffer lets us go back 2407 // up to two tokens from the last scanned token. Internally, the relative 2408 // number of backward steps that were taken (via ungetToken()) after the last 2409 // token was scanned is stored in |lookahead|. 2410 // 2411 // The following table lists in which situations it is safe to call each listed 2412 // function. No checks are made by the functions in non-debug builds. 2413 // 2414 // Function Name | Precondition; changes to |lookahead| 2415 // ------------------+--------------------------------------------------------- 2416 // getToken | none; if |lookahead > 0| then |lookahead--| 2417 // peekToken | none; if |lookahead == 0| then |lookahead == 1| 2418 // peekTokenSameLine | none; if |lookahead == 0| then |lookahead == 1| 2419 // matchToken | none; if |lookahead > 0| and the match succeeds then 2420 // | |lookahead--| 2421 // consumeKnownToken | none; if |lookahead > 0| then |lookahead--| 2422 // ungetToken | 0 <= |lookahead| <= |maxLookahead - 1|; |lookahead++| 2423 // 2424 // The behavior of the token scanning process (see getTokenInternal()) can be 2425 // modified by calling one of the first four above listed member functions with 2426 // an optional argument of type Modifier. However, the modifier will be 2427 // ignored unless |lookahead == 0| holds. Due to constraints of the grammar, 2428 // this turns out not to be a problem in practice. See the 2429 // mozilla.dev.tech.js-engine.internals thread entitled 'Bug in the scanner?' 2430 // for more details: 2431 // https://groups.google.com/forum/?fromgroups=#!topic/mozilla.dev.tech.js-engine.internals/2JLH5jRcr7E). 2432 // 2433 // The method seek() allows rescanning from a previously visited location of 2434 // the buffer, initially computed by constructing a Position local variable. 2435 // 2436 template <typename Unit, class AnyCharsAccess> 2437 class MOZ_STACK_CLASS TokenStreamSpecific 2438 : public TokenStreamChars<Unit, AnyCharsAccess>, 2439 public TokenStreamShared, 2440 public ErrorReporter { 2441 public: 2442 using CharsBase = TokenStreamCharsBase<Unit>; 2443 using SpecializedCharsBase = SpecializedTokenStreamCharsBase<Unit>; 2444 using GeneralCharsBase = GeneralTokenStreamChars<Unit, AnyCharsAccess>; 2445 using SpecializedChars = TokenStreamChars<Unit, AnyCharsAccess>; 2446 2447 using Position = TokenStreamPosition<Unit>; 2448 2449 // Anything inherited through a base class whose type depends upon this 2450 // class's template parameters can only be accessed through a dependent 2451 // name: prefixed with |this|, by explicit qualification, and so on. (This 2452 // is so that references to inherited fields are statically distinguishable 2453 // from references to names outside of the class.) This is tedious and 2454 // onerous. 2455 // 2456 // As an alternative, we directly add every one of these functions to this 2457 // class, using explicit qualification to address the dependent-name 2458 // problem. |this| or other qualification is no longer necessary -- at 2459 // cost of this ever-changing laundry list of |using|s. So it goes. 2460 public: 2461 using GeneralCharsBase::anyCharsAccess; 2462 using GeneralCharsBase::computeLineAndColumn; 2463 using TokenStreamCharsShared::adoptState; 2464 2465 private: 2466 using typename CharsBase::SourceUnits; 2467 2468 private: 2469 using CharsBase::atomizeSourceChars; 2470 using GeneralCharsBase::badToken; 2471 // Deliberately don't |using| |charBuffer| because of bug 1472569. :-( 2472 using CharsBase::consumeKnownCodeUnit; 2473 using CharsBase::matchCodeUnit; 2474 using CharsBase::matchLineTerminator; 2475 using CharsBase::peekCodeUnit; 2476 using GeneralCharsBase::computeColumn; 2477 using GeneralCharsBase::fillExceptingContext; 2478 using GeneralCharsBase::getCodeUnit; 2479 using GeneralCharsBase::getFullAsciiCodePoint; 2480 using GeneralCharsBase::internalComputeLineOfContext; 2481 using GeneralCharsBase::matchUnicodeEscapeIdent; 2482 using GeneralCharsBase::matchUnicodeEscapeIdStart; 2483 using GeneralCharsBase::newAtomToken; 2484 using GeneralCharsBase::newBigIntToken; 2485 using GeneralCharsBase::newNameToken; 2486 using GeneralCharsBase::newNumberToken; 2487 using GeneralCharsBase::newPrivateNameToken; 2488 using GeneralCharsBase::newRegExpToken; 2489 using GeneralCharsBase::newSimpleToken; 2490 using SpecializedChars::getNonAsciiCodePoint; 2491 using SpecializedChars::getNonAsciiCodePointDontNormalize; 2492 using TokenStreamCharsShared::copyCharBufferTo; 2493 using TokenStreamCharsShared::drainCharBufferIntoAtom; 2494 using TokenStreamCharsShared::isAsciiCodePoint; 2495 // Deliberately don't |using| |sourceUnits| because of bug 1472569. :-( 2496 using CharsBase::toUnit; 2497 using GeneralCharsBase::ungetCodeUnit; 2498 using GeneralCharsBase::updateLineInfoForEOL; 2499 2500 template <typename CharU> 2501 friend class TokenStreamPosition; 2502 2503 public: 2504 TokenStreamSpecific(FrontendContext* fc, ParserAtomsTable* parserAtoms, 2505 const JS::ReadOnlyCompileOptions& options, 2506 const Unit* units, size_t length); 2507 2508 /** 2509 * Get the next code point, converting LineTerminatorSequences to '\n' and 2510 * updating internal line-counter state if needed. Return true on success. 2511 * Return false on failure. 2512 */ 2513 [[nodiscard]] MOZ_ALWAYS_INLINE bool getCodePoint() { 2514 int32_t unit = getCodeUnit(); 2515 if (MOZ_UNLIKELY(unit == EOF)) { 2516 MOZ_ASSERT(anyCharsAccess().flags.isEOF, 2517 "flags.isEOF should have been set by getCodeUnit()"); 2518 return true; 2519 } 2520 2521 if (isAsciiCodePoint(unit)) { 2522 return getFullAsciiCodePoint(unit); 2523 } 2524 2525 char32_t cp; 2526 return getNonAsciiCodePoint(unit, &cp); 2527 } 2528 2529 // If there is an invalid escape in a template, report it and return false, 2530 // otherwise return true. 2531 bool checkForInvalidTemplateEscapeError() { 2532 if (anyCharsAccess().invalidTemplateEscapeType == InvalidEscapeType::None) { 2533 return true; 2534 } 2535 2536 reportInvalidEscapeError(anyCharsAccess().invalidTemplateEscapeOffset, 2537 anyCharsAccess().invalidTemplateEscapeType); 2538 return false; 2539 } 2540 2541 public: 2542 // Implement ErrorReporter. 2543 2544 std::optional<bool> isOnThisLine(size_t offset, 2545 uint32_t lineNum) const final { 2546 return anyCharsAccess().srcCoords.isOnThisLine(offset, lineNum); 2547 } 2548 2549 uint32_t lineAt(size_t offset) const final { 2550 const auto& anyChars = anyCharsAccess(); 2551 auto lineToken = anyChars.lineToken(offset); 2552 return anyChars.lineNumber(lineToken); 2553 } 2554 2555 JS::LimitedColumnNumberOneOrigin columnAt(size_t offset) const final { 2556 return computeColumn(anyCharsAccess().lineToken(offset), offset); 2557 } 2558 2559 private: 2560 // Implement ErrorReportMixin. 2561 2562 FrontendContext* getContext() const override { 2563 return anyCharsAccess().context(); 2564 } 2565 2566 [[nodiscard]] bool strictMode() const override { 2567 return anyCharsAccess().strictMode(); 2568 } 2569 2570 public: 2571 // Implement ErrorReportMixin. 2572 2573 const JS::ReadOnlyCompileOptions& options() const final { 2574 return anyCharsAccess().options(); 2575 } 2576 2577 [[nodiscard]] bool computeErrorMetadata( 2578 ErrorMetadata* err, const ErrorOffset& errorOffset) const override; 2579 2580 private: 2581 void reportInvalidEscapeError(uint32_t offset, InvalidEscapeType type) { 2582 switch (type) { 2583 case InvalidEscapeType::None: 2584 MOZ_ASSERT_UNREACHABLE("unexpected InvalidEscapeType"); 2585 return; 2586 case InvalidEscapeType::Hexadecimal: 2587 errorAt(offset, JSMSG_MALFORMED_ESCAPE, "hexadecimal"); 2588 return; 2589 case InvalidEscapeType::Unicode: 2590 errorAt(offset, JSMSG_MALFORMED_ESCAPE, "Unicode"); 2591 return; 2592 case InvalidEscapeType::UnicodeOverflow: 2593 errorAt(offset, JSMSG_UNICODE_OVERFLOW, "escape sequence"); 2594 return; 2595 case InvalidEscapeType::Octal: 2596 errorAt(offset, JSMSG_DEPRECATED_OCTAL_ESCAPE); 2597 return; 2598 case InvalidEscapeType::EightOrNine: 2599 errorAt(offset, JSMSG_DEPRECATED_EIGHT_OR_NINE_ESCAPE); 2600 return; 2601 } 2602 } 2603 2604 void reportIllegalCharacter(int32_t cp); 2605 2606 [[nodiscard]] bool putIdentInCharBuffer(const Unit* identStart); 2607 2608 using IsIntegerUnit = bool (*)(int32_t); 2609 [[nodiscard]] MOZ_ALWAYS_INLINE bool matchInteger(IsIntegerUnit isIntegerUnit, 2610 int32_t* nextUnit); 2611 [[nodiscard]] MOZ_ALWAYS_INLINE bool matchIntegerAfterFirstDigit( 2612 IsIntegerUnit isIntegerUnit, int32_t* nextUnit); 2613 2614 /** 2615 * Tokenize a decimal number that begins at |numStart| into the provided 2616 * token. 2617 * 2618 * |unit| must be one of these values: 2619 * 2620 * 1. The first decimal digit in the integral part of a decimal number 2621 * not starting with '.', e.g. '1' for "17", '0' for "0.14", or 2622 * '8' for "8.675309e6". 2623 * 2624 * In this case, the next |getCodeUnit()| must return the code unit after 2625 * |unit| in the overall number. 2626 * 2627 * 2. The '.' in a "."-prefixed decimal number, e.g. ".17" or ".1e3". 2628 * 2629 * In this case, the next |getCodeUnit()| must return the code unit 2630 * *after* the '.'. 2631 * 2632 * 3. (Non-strict mode code only) The first non-ASCII-digit unit for a 2633 * "noctal" number that begins with a '0' but contains a non-octal digit 2634 * in its integer part so is interpreted as decimal, e.g. '.' in "09.28" 2635 * or EOF for "0386" or '+' in "09+7" (three separate tokens). 2636 * 2637 * In this case, the next |getCodeUnit()| returns the code unit after 2638 * |unit|: '2', 'EOF', or '7' in the examples above. 2639 * 2640 * This interface is super-hairy and horribly stateful. Unfortunately, its 2641 * hair merely reflects the intricacy of ECMAScript numeric literal syntax. 2642 * And incredibly, it *improves* on the goto-based horror that predated it. 2643 */ 2644 [[nodiscard]] bool decimalNumber(int32_t unit, TokenStart start, 2645 const Unit* numStart, Modifier modifier, 2646 TokenKind* out); 2647 2648 /** Tokenize a regular expression literal beginning at |start|. */ 2649 [[nodiscard]] bool regexpLiteral(TokenStart start, TokenKind* out); 2650 2651 /** 2652 * Slurp characters between |start| and sourceUnits.current() into 2653 * charBuffer, to later parse into a bigint. 2654 */ 2655 [[nodiscard]] bool bigIntLiteral(TokenStart start, Modifier modifier, 2656 TokenKind* out); 2657 2658 public: 2659 // Advance to the next token. If the token stream encountered an error, 2660 // return false. Otherwise return true and store the token kind in |*ttp|. 2661 [[nodiscard]] bool getToken(TokenKind* ttp, Modifier modifier = SlashIsDiv) { 2662 // Check for a pushed-back token resulting from mismatching lookahead. 2663 TokenStreamAnyChars& anyChars = anyCharsAccess(); 2664 if (anyChars.lookahead != 0) { 2665 MOZ_ASSERT(!anyChars.flags.hadError); 2666 anyChars.lookahead--; 2667 anyChars.advanceCursor(); 2668 TokenKind tt = anyChars.currentToken().type; 2669 MOZ_ASSERT(tt != TokenKind::Eol); 2670 verifyConsistentModifier(modifier, anyChars.currentToken()); 2671 *ttp = tt; 2672 return true; 2673 } 2674 2675 return getTokenInternal(ttp, modifier); 2676 } 2677 2678 [[nodiscard]] bool peekToken(TokenKind* ttp, Modifier modifier = SlashIsDiv) { 2679 TokenStreamAnyChars& anyChars = anyCharsAccess(); 2680 if (anyChars.lookahead > 0) { 2681 MOZ_ASSERT(!anyChars.flags.hadError); 2682 verifyConsistentModifier(modifier, anyChars.nextToken()); 2683 *ttp = anyChars.nextToken().type; 2684 return true; 2685 } 2686 if (!getTokenInternal(ttp, modifier)) { 2687 return false; 2688 } 2689 anyChars.ungetToken(); 2690 return true; 2691 } 2692 2693 [[nodiscard]] bool peekTokenPos(TokenPos* posp, 2694 Modifier modifier = SlashIsDiv) { 2695 TokenStreamAnyChars& anyChars = anyCharsAccess(); 2696 if (anyChars.lookahead == 0) { 2697 TokenKind tt; 2698 if (!getTokenInternal(&tt, modifier)) { 2699 return false; 2700 } 2701 anyChars.ungetToken(); 2702 MOZ_ASSERT(anyChars.hasLookahead()); 2703 } else { 2704 MOZ_ASSERT(!anyChars.flags.hadError); 2705 verifyConsistentModifier(modifier, anyChars.nextToken()); 2706 } 2707 *posp = anyChars.nextToken().pos; 2708 return true; 2709 } 2710 2711 [[nodiscard]] bool peekOffset(uint32_t* offset, 2712 Modifier modifier = SlashIsDiv) { 2713 TokenPos pos; 2714 if (!peekTokenPos(&pos, modifier)) { 2715 return false; 2716 } 2717 *offset = pos.begin; 2718 return true; 2719 } 2720 2721 // This is like peekToken(), with one exception: if there is an EOL 2722 // between the end of the current token and the start of the next token, it 2723 // return true and store Eol in |*ttp|. In that case, no token with 2724 // Eol is actually created, just a Eol TokenKind is returned, and 2725 // currentToken() shouldn't be consulted. (This is the only place Eol 2726 // is produced.) 2727 [[nodiscard]] MOZ_ALWAYS_INLINE bool peekTokenSameLine( 2728 TokenKind* ttp, Modifier modifier = SlashIsDiv) { 2729 TokenStreamAnyChars& anyChars = anyCharsAccess(); 2730 const Token& curr = anyChars.currentToken(); 2731 2732 // If lookahead != 0, we have scanned ahead at least one token, and 2733 // |lineno| is the line that the furthest-scanned token ends on. If 2734 // it's the same as the line that the current token ends on, that's a 2735 // stronger condition than what we are looking for, and we don't need 2736 // to return Eol. 2737 if (anyChars.lookahead != 0) { 2738 std::optional<bool> onThisLineStatus = 2739 anyChars.srcCoords.isOnThisLine(curr.pos.end, anyChars.lineno); 2740 if (!onThisLineStatus.has_value()) { 2741 error(JSMSG_OUT_OF_MEMORY); 2742 return false; 2743 } 2744 2745 bool onThisLine = *onThisLineStatus; 2746 if (onThisLine) { 2747 MOZ_ASSERT(!anyChars.flags.hadError); 2748 verifyConsistentModifier(modifier, anyChars.nextToken()); 2749 *ttp = anyChars.nextToken().type; 2750 return true; 2751 } 2752 } 2753 2754 // The above check misses two cases where we don't have to return 2755 // Eol. 2756 // - The next token starts on the same line, but is a multi-line token. 2757 // - The next token starts on the same line, but lookahead==2 and there 2758 // is a newline between the next token and the one after that. 2759 // The following test is somewhat expensive but gets these cases (and 2760 // all others) right. 2761 TokenKind tmp; 2762 if (!getToken(&tmp, modifier)) { 2763 return false; 2764 } 2765 2766 const Token& next = anyChars.currentToken(); 2767 anyChars.ungetToken(); 2768 2769 // Careful, |next| points to an initialized-but-not-allocated Token! 2770 // This is safe because we don't modify token data below. 2771 2772 auto currentEndToken = anyChars.lineToken(curr.pos.end); 2773 auto nextBeginToken = anyChars.lineToken(next.pos.begin); 2774 2775 *ttp = 2776 currentEndToken.isSameLine(nextBeginToken) ? next.type : TokenKind::Eol; 2777 return true; 2778 } 2779 2780 // Get the next token from the stream if its kind is |tt|. 2781 [[nodiscard]] bool matchToken(bool* matchedp, TokenKind tt, 2782 Modifier modifier = SlashIsDiv) { 2783 TokenKind token; 2784 if (!getToken(&token, modifier)) { 2785 return false; 2786 } 2787 if (token == tt) { 2788 *matchedp = true; 2789 } else { 2790 anyCharsAccess().ungetToken(); 2791 *matchedp = false; 2792 } 2793 return true; 2794 } 2795 2796 void consumeKnownToken(TokenKind tt, Modifier modifier = SlashIsDiv) { 2797 bool matched; 2798 MOZ_ASSERT(anyCharsAccess().hasLookahead()); 2799 MOZ_ALWAYS_TRUE(matchToken(&matched, tt, modifier)); 2800 MOZ_ALWAYS_TRUE(matched); 2801 } 2802 2803 [[nodiscard]] bool nextTokenEndsExpr(bool* endsExpr) { 2804 TokenKind tt; 2805 if (!peekToken(&tt)) { 2806 return false; 2807 } 2808 2809 *endsExpr = anyCharsAccess().isExprEnding[size_t(tt)]; 2810 if (*endsExpr) { 2811 // If the next token ends an overall Expression, we'll parse this 2812 // Expression without ever invoking Parser::orExpr(). But we need that 2813 // function's DEBUG-only side effect of marking this token as safe to get 2814 // with SlashIsRegExp, so we have to do it manually here. 2815 anyCharsAccess().allowGettingNextTokenWithSlashIsRegExp(); 2816 } 2817 return true; 2818 } 2819 2820 [[nodiscard]] bool advance(size_t position); 2821 2822 void seekTo(const Position& pos); 2823 [[nodiscard]] bool seekTo(const Position& pos, 2824 const TokenStreamAnyChars& other); 2825 2826 void rewind(const Position& pos) { 2827 MOZ_ASSERT(pos.buf <= this->sourceUnits.addressOfNextCodeUnit(), 2828 "should be rewinding here"); 2829 seekTo(pos); 2830 } 2831 2832 [[nodiscard]] bool rewind(const Position& pos, 2833 const TokenStreamAnyChars& other) { 2834 MOZ_ASSERT(pos.buf <= this->sourceUnits.addressOfNextCodeUnit(), 2835 "should be rewinding here"); 2836 return seekTo(pos, other); 2837 } 2838 2839 void fastForward(const Position& pos) { 2840 MOZ_ASSERT(this->sourceUnits.addressOfNextCodeUnit() <= pos.buf, 2841 "should be moving forward here"); 2842 seekTo(pos); 2843 } 2844 2845 [[nodiscard]] bool fastForward(const Position& pos, 2846 const TokenStreamAnyChars& other) { 2847 MOZ_ASSERT(this->sourceUnits.addressOfNextCodeUnit() <= pos.buf, 2848 "should be moving forward here"); 2849 return seekTo(pos, other); 2850 } 2851 2852 const Unit* codeUnitPtrAt(size_t offset) const { 2853 return this->sourceUnits.codeUnitPtrAt(offset); 2854 } 2855 2856 [[nodiscard]] bool identifierName(TokenStart start, const Unit* identStart, 2857 IdentifierEscapes escaping, 2858 Modifier modifier, 2859 NameVisibility visibility, TokenKind* out); 2860 2861 [[nodiscard]] bool matchIdentifierStart(IdentifierEscapes* sawEscape); 2862 2863 [[nodiscard]] bool getTokenInternal(TokenKind* const ttp, 2864 const Modifier modifier); 2865 2866 [[nodiscard]] bool getStringOrTemplateToken(char untilChar, Modifier modifier, 2867 TokenKind* out); 2868 2869 // Parse a TemplateMiddle or TemplateTail token (one of the string-like parts 2870 // of a template string) after already consuming the leading `RightCurly`. 2871 // (The spec says the `}` is the first character of the TemplateMiddle/ 2872 // TemplateTail, but we treat it as a separate token because that's much 2873 // easier to implement in both TokenStream and the parser.) 2874 // 2875 // This consumes a token and sets the current token, like `getToken()`. It 2876 // doesn't take a Modifier because there's no risk of encountering a division 2877 // operator or RegExp literal. 2878 // 2879 // On success, `*ttp` is either `TokenKind::TemplateHead` (if we got a 2880 // TemplateMiddle token) or `TokenKind::NoSubsTemplate` (if we got a 2881 // TemplateTail). That may seem strange; there are four different template 2882 // token types in the spec, but we only use two. We use `TemplateHead` for 2883 // TemplateMiddle because both end with `...${`, and `NoSubsTemplate` for 2884 // TemplateTail because both contain the end of the template, including the 2885 // closing quote mark. They're not treated differently, either in the parser 2886 // or in the tokenizer. 2887 [[nodiscard]] bool getTemplateToken(TokenKind* ttp) { 2888 MOZ_ASSERT(anyCharsAccess().currentToken().type == TokenKind::RightCurly); 2889 return getStringOrTemplateToken('`', SlashIsInvalid, ttp); 2890 } 2891 2892 [[nodiscard]] bool getDirectives(bool isMultiline, bool shouldWarnDeprecated); 2893 [[nodiscard]] bool getDirective( 2894 bool isMultiline, bool shouldWarnDeprecated, const char* directive, 2895 uint8_t directiveLength, const char* errorMsgPragma, 2896 UniquePtr<char16_t[], JS::FreePolicy>* destination); 2897 [[nodiscard]] bool getDisplayURL(bool isMultiline, bool shouldWarnDeprecated); 2898 [[nodiscard]] bool getSourceMappingURL(bool isMultiline, 2899 bool shouldWarnDeprecated); 2900 }; 2901 2902 // It's preferable to define this in TokenStream.cpp, but its template-ness 2903 // means we'd then have to *instantiate* this constructor for all possible 2904 // (Unit, AnyCharsAccess) pairs -- and that gets super-messy as AnyCharsAccess 2905 // *itself* is templated. This symbol really isn't that huge compared to some 2906 // defined inline in TokenStreamSpecific, so just rely on the linker commoning 2907 // stuff up. 2908 template <typename Unit> 2909 template <class AnyCharsAccess> 2910 inline TokenStreamPosition<Unit>::TokenStreamPosition( 2911 TokenStreamSpecific<Unit, AnyCharsAccess>& tokenStream) 2912 : currentToken(tokenStream.anyCharsAccess().currentToken()) { 2913 TokenStreamAnyChars& anyChars = tokenStream.anyCharsAccess(); 2914 2915 buf = 2916 tokenStream.sourceUnits.addressOfNextCodeUnit(/* allowPoisoned = */ true); 2917 flags = anyChars.flags; 2918 lineno = anyChars.lineno; 2919 linebase = anyChars.linebase; 2920 prevLinebase = anyChars.prevLinebase; 2921 lookahead = anyChars.lookahead; 2922 currentToken = anyChars.currentToken(); 2923 for (unsigned i = 0; i < anyChars.lookahead; i++) { 2924 lookaheadTokens[i] = anyChars.tokens[anyChars.aheadCursor(1 + i)]; 2925 } 2926 } 2927 2928 class TokenStreamAnyCharsAccess { 2929 public: 2930 template <class TokenStreamSpecific> 2931 static inline TokenStreamAnyChars& anyChars(TokenStreamSpecific* tss); 2932 2933 template <class TokenStreamSpecific> 2934 static inline const TokenStreamAnyChars& anyChars( 2935 const TokenStreamSpecific* tss); 2936 }; 2937 2938 class MOZ_STACK_CLASS TokenStream 2939 : public TokenStreamAnyChars, 2940 public TokenStreamSpecific<char16_t, TokenStreamAnyCharsAccess> { 2941 using Unit = char16_t; 2942 2943 public: 2944 TokenStream(FrontendContext* fc, ParserAtomsTable* parserAtoms, 2945 const JS::ReadOnlyCompileOptions& options, const Unit* units, 2946 size_t length, StrictModeGetter* smg) 2947 : TokenStreamAnyChars(fc, options, smg), 2948 TokenStreamSpecific<Unit, TokenStreamAnyCharsAccess>( 2949 fc, parserAtoms, options, units, length) {} 2950 }; 2951 2952 class MOZ_STACK_CLASS DummyTokenStream final : public TokenStream { 2953 public: 2954 DummyTokenStream(FrontendContext* fc, 2955 const JS::ReadOnlyCompileOptions& options) 2956 : TokenStream(fc, nullptr, options, nullptr, 0, nullptr) {} 2957 }; 2958 2959 template <class TokenStreamSpecific> 2960 /* static */ inline TokenStreamAnyChars& TokenStreamAnyCharsAccess::anyChars( 2961 TokenStreamSpecific* tss) { 2962 auto* ts = static_cast<TokenStream*>(tss); 2963 return *static_cast<TokenStreamAnyChars*>(ts); 2964 } 2965 2966 template <class TokenStreamSpecific> 2967 /* static */ inline const TokenStreamAnyChars& 2968 TokenStreamAnyCharsAccess::anyChars(const TokenStreamSpecific* tss) { 2969 const auto* ts = static_cast<const TokenStream*>(tss); 2970 return *static_cast<const TokenStreamAnyChars*>(ts); 2971 } 2972 2973 extern const char* TokenKindToDesc(TokenKind tt); 2974 2975 } // namespace frontend 2976 } // namespace js 2977 2978 #ifdef DEBUG 2979 extern const char* TokenKindToString(js::frontend::TokenKind tt); 2980 #endif 2981 2982 #endif /* frontend_TokenStream_h */