tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

PortableBaselineInterpret.h (16973B)


      1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*-
      2 * vim: set ts=8 sts=2 et sw=2 tw=80:
      3 * This Source Code Form is subject to the terms of the Mozilla Public
      4 * License, v. 2.0. If a copy of the MPL was not distributed with this
      5 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
      6 
      7 #ifndef vm_PortableBaselineInterpret_h
      8 #define vm_PortableBaselineInterpret_h
      9 
     10 /*
     11 * [SMDOC] Portable Baseline Interpreter
     12 * =====================================
     13 *
     14 * The Portable Baseline Interpreter (PBL) is a portable interpreter
     15 * that supports executing ICs by directly interpreting CacheIR.
     16 *
     17 * This interpreter tier fits into the hierarchy between the C++
     18 * interpreter, which is fully generic and does not specialize with
     19 * ICs, and the native baseline interpreter, which does attach and
     20 * execute ICs but requires native codegen (JIT). The distinguishing
     21 * feature of PBL is that it *does not require codegen*: it can run on
     22 * any platform for which SpiderMonkey supports an interpreter-only
     23 * build. This is useful both for platforms that do not support
     24 * runtime addition of new code (e.g., running within a WebAssembly
     25 * module with a `wasm32-wasi` build) or may disallow it for security
     26 * reasons.
     27 *
     28 * The main idea of PBL is to emulate, as much as possible, how the
     29 * native baseline interpreter works, so that the rest of the engine
     30 * can work the same either way. The main aspect of this "emulation"
     31 * comes with stack frames: unlike the native blinterp and JIT tiers,
     32 * we cannot use the machine stack, because we are still executing in
     33 * portable C++ code and the platform's C++ compiler controls the
     34 * machine stack's layout. Instead, we use an auxiliary stack.
     35 *
     36 * Auxiliary Stack
     37 * ---------------
     38 *
     39 * PBL creates baseline stack frames (see `BaselineFrame` and related
     40 * structs) on an *auxiliary stack*, contiguous memory allocated and
     41 * owned by the JSRuntime.
     42 *
     43 * This stack operates nearly identically to the machine stack: it
     44 * grows downward, we push stack frames, we maintain a linked list of
     45 * frame pointers, and a series of contiguous frames form a
     46 * `JitActivation`, with the most recent activation reachable from the
     47 * `JSContext`. The only actual difference is that the return address
     48 * slots in the frame layouts are always null pointers, because there
     49 * is no need to save return addresses: we always know where we are
     50 * going to return to (either post-IC code -- the return point of
     51 * which is known because we actually do a C++-level call from the
     52 * JSOp interpreter to the IC interpreter -- or to dispatch the next
     53 * JSOp).
     54 *
     55 * The same invariants as for native baseline code apply here: when we
     56 * are in `PortableBaselineInterpret` (the PBL interpreter body) or
     57 * `ICInterpretOps` (the IC interpreter) or related helpers, it is as
     58 * if we are in JIT code, and local state determines the top-of-stack
     59 * and innermost frame. The activation is not "finished" and cannot be
     60 * traversed. When we need to call into the rest of SpiderMonkey, we
     61 * emulate how that would work in JIT code, via an exit frame (that
     62 * would ordinarily be pushed by a trampoline) and saving that frame
     63 * as the exit-frame pointer in the VM state.
     64 *
     65 * To add a little compile-time enforcement of this strategy, and
     66 * ensure that we don't accidentally call something that will want to
     67 * traverse the (in-progress and not-completed) JIT activation, we use
     68 * a helper class `VMFrame` that pushes and pops the exit frame,
     69 * wrapping the callsite into the rest of SM with an RAII idiom. Then,
     70 * we *hide the `JSContext`*, and rely on the idiom that `cx` is
     71 * passed to anything that can GC or otherwise observe the JIT
     72 * state. The `JSContext` is passed in as `cx_`, and we name the
     73 * `VMFrame` local `cx` in the macro that invokes it; this `cx` then
     74 * has an implicit conversion to a `JSContext*` value and reveals the
     75 * real context.
     76 *
     77 * Interpreter Loops
     78 * -----------------
     79 *
     80 * There are two interpreter loops: the JSOp interpreter and the
     81 * CacheIR interpreter. These closely correspond to (i) the blinterp
     82 * body that is generated at startup for the native baseline
     83 * interpreter, and (ii) an interpreter version of the code generated
     84 * by the `BaselineCacheIRCompiler`, respectively.
     85 *
     86 * Execution begins in the JSOp interpreter, and for any op(*) that
     87 * has an IC site (`JOF_IC` flag), we invoke the IC interpreter. The
     88 * IC interpreter runs a loop that traverses the IC stub chain, either
     89 * reaching CacheIR bytecode and executing it in a virtual machine, or
     90 * reaching the fallback stub and executing that (likely pushing an
     91 * exit frame and calling into the rest of SpiderMonkey).
     92 *
     93 * (*) As an optimization, some opcodes that would have IC sites in
     94 * native baseline skip their IC chains and run generic code instead
     95 * in PBL. See "Hybrid IC mode" below for more details.
     96 *
     97 * IC Interpreter State
     98 * --------------------
     99 *
    100 * While the JS opcode interpreter's abstract machine model and its
    101 * mapping of those abstract semantics to real machine state are
    102 * well-defined (by the other baseline tiers), the IC interpreter's
    103 * mapping is less so. When executing in native baseline tiers,
    104 * CacheIR is compiled to machine code that undergoes register
    105 * allocation and several optimizations (e.g., handling constants
    106 * specially, and eliding type-checks on values when we know their
    107 * actual types). No other interpreter for CacheIR exists, so we get
    108 * to define how we map the semantics to interpreter state.
    109 *
    110 * We choose to keep an array of uint64_t values as "virtual
    111 * registers", each corresponding to a particular OperandId, and we
    112 * store the same values that would exist in the native machine
    113 * registers. In other words, we do not do any sort of register
    114 * allocation or reclamation of storage slots, because we don't have
    115 * any lookahead in the interpreter. We rely on the typesafe writer
    116 * API, with newtype'd wrappers for different kinds of values
    117 * (`ValOperandId`, `ObjOperandId`, `Int32OperandId`, etc.), producing
    118 * typesafe CacheIR bytecode, in order to properly store and interpret
    119 * unboxed values in the virtual registers.
    120 *
    121 * There are several subtle details usually handled by register
    122 * allocation in the CacheIR compilers that need to be handled here
    123 * too, mainly around input arguments and restoring state when
    124 * chaining to the next IC stub. IC callsites place inputs into the
    125 * first N OperandId registers directly, corresponding to what the
    126 * CacheIR expects. There are some CacheIR opcodes that mutate their
    127 * argument in-place (e.g., guarding that a Value is an Object strips
    128 * the tag-bits from the Value and turns it into a raw pointer), so we
    129 * cannot rely on these remaining unmodified if we need to invoke the
    130 * next IC in the chain; instead, we save and restore the first N
    131 * values in the chain-walking loop (according to the arity of the IC
    132 * kind).
    133 *
    134 * Optimizations
    135 * ------------
    136 *
    137 * There are several implementation details that are critical for
    138 * performance, and thus should be carefully maintained or verified
    139 * with any changes:
    140 *
    141 * - Caching values in locals: in order to be competitive with "native
    142 *   baseline interpreter", which has the advantage of using machine
    143 *   registers for commonly-accessed values such as the
    144 *   top-of-operand-stack and the JS opcode PC, we are careful to
    145 *   ensure that the C++ compiler can keep these values in registers
    146 *   in PBL as well. One might naively store `pc`, `sp`, `fp`, and the
    147 *   like in a context struct (of "virtual CPU registers") that is
    148 *   passed to e.g. the IC interpreter. This would be a mistake: if
    149 *   the values exist in memory, the compiler cannot "lift" them to
    150 *   locals that can live in registers, and so every push and pop (for
    151 *   example) performs a store. This overhead is significant,
    152 *   especially when executing more "lightweight" opcodes.
    153 *
    154 *   We make use of an important property -- the balanced-stack
    155 *   invariant -- so that we can pass SP *into* calls but not take an
    156 *   updated SP *from* them. When invoking an IC, we expect that when
    157 *   it returns, SP will be at the same location (one could think of
    158 *   SP as a "callee-saved register", though it's not usually
    159 *   described that way). Thus, we can avoid a dependency on a value
    160 *   that would have to be passed back through memory.
    161 *
    162 * - Hybrid IC mode: the fact that we *interpret* ICs now means that
    163 *   they are more expensive to invoke. Whereas a small IC that guards
    164 *   two int32 arguments, performs an int32 add, and returns might
    165 *   have been a handful of instructions before, and the call/ret pair
    166 *   would have been very fast (and easy to predict) instructions at
    167 *   the machine level, the setup and context transition and the
    168 *   CacheIR opcode dispatch overhead would likely be much slower than
    169 *   a generic "if both int32, add" fastpath in the interpreter case
    170 *   for `JSOp::Add`.
    171 *
    172 *   We thus take a hybrid approach, and include these static
    173 *   fastpaths for what would have been ICs in "native
    174 *   baseline". These are enabled by the `kHybridICs` global and may
    175 *   be removed in the future (transitioning back to ICs) if/when we
    176 *   can reduce the cost of interpreted ICs further.
    177 *
    178 *   Right now, calls and property accesses use ICs:
    179 *
    180 *   - Calls can often be special-cased with CacheIR when intrinsics
    181 *     are invoked. For example, a call to `String.length` can turn
    182 *     into a CacheIR opcode that directly reads a `JSString`'s length
    183 *     field.
    184 *   - Property accesses are so frequent, and the shape-lookup path
    185 *     is slow enough, that it still makes sense to guard on shape
    186 *     and quickly return a particular slot.
    187 *
    188 * - Static branch prediction for opcode dispatch: we adopt an
    189 *   interpreter optimization we call "static branch prediction": when
    190 *   one opcode is often followed by another, it is often more
    191 *   efficient to check for those specific cases first and branch
    192 *   directly to the case for the following opcode, doing the full
    193 *   switch otherwise. This is especially true when the indirect
    194 *   branches used by `switch` statements or computed gotos are
    195 *   expensive on a given platform, such as Wasm.
    196 *
    197 * - Inlining: on some platforms, calls are expensive, and we want to
    198 *   avoid them whenever possible. We have found that it is quite
    199 *   important for performance to inline the IC interpreter into the
    200 *   JSOp interpreter at IC sites: both functions are quite large,
    201 *   with significant local state, and so otherwise, each IC call
    202 *   involves a lot of "context switching" as the code generated by
    203 *   the C++ compiler saves registers and constructs a new native
    204 *   frame. This is certainly a code-size tradeoff, but we have
    205 *   optimized for speed here.
    206 *
    207 * - Amortized stack checks: a naive interpreter implementation would
    208 *   check for auxiliary stack overflow on every push. We instead do
    209 *   this once when we enter a new JS function frame, using the
    210 *   script's precomputed "maximum stack depth" value. We keep a small
    211 *   stack margin always available, so that we have enough space to
    212 *   push an exit frame and invoke the "over-recursed" helper (which
    213 *   throws an exception) when we would otherwise overflow. The stack
    214 *   checks take this margin into account, failing if there would be
    215 *   less than the margin available at any point in the called
    216 *   function.
    217 *
    218 * - Fastpaths for calls and returns: we are able to push and pop JS
    219 *   stack frames while remaining in one native (C++ interpreter
    220 *   function) frame, just as the C++ interpreter does. This means
    221 *   that there is a one-to-many mapping from native stack frame to JS
    222 *   stack frame. This does create some complications at points that
    223 *   pop frames: we might remain in the same C++ frame, or we might
    224 *   return at the C++ level. We handle this in a unified way for
    225 *   returns and exception unwinding as described below.
    226 *
    227 * Unwinding
    228 * ---------
    229 *
    230 * Because one C++ interpreter frame can correspond to multiple JS
    231 * frames, we need to disambiguate the two cases whenever leaving a
    232 * frame: we may need to return, or we may stay in the current
    233 * function and dispatch the next opcode at the caller's next PC.
    234 *
    235 * Exception unwinding compilcates this further. PBL uses the same
    236 * exception-handling code that native baseline does, and this code
    237 * computes a `ResumeFromException` struct that tells us what our new
    238 * stack pointer and frame pointer must be. These values could be
    239 * arbitrarily far "up" the stack in the current activation. It thus
    240 * wouldn't be sufficient to count how many JS frames we have, and
    241 * return at the C++ level when this reaches zero: we need to "unwind"
    242 * the C++ frames until we reach the appropriate JS frame.
    243 *
    244 * To solve both issues, we remember the "entry frame" when we enter a
    245 * new invocation of `PortableBaselineInterpret()`, and when returning
    246 * or unwinding, if the new frame is *above* this entry frame, we
    247 * return. We have an enum `PBIResult` that can encode, when
    248 * unwinding, *which* kind of unwinding we are doing, because when we
    249 * do eventually reach the C++ frame that owns the newly active JS
    250 * frame, we may resume into a different action depending on this
    251 * information.
    252 *
    253 * Completeness
    254 * ------------
    255 *
    256 * Whenever a new JSOp is added, the opcode needs to be added to
    257 * PBL. The compiler should enforce this: if no case is implemented
    258 * for an opcode, then the label in the computed-goto table will be
    259 * missing and PBL will not compile.
    260 *
    261 * In contrast, CacheIR opcodes need not be implemented right away,
    262 * and in fact right now most of the less-common ones are not
    263 * implemented by PBL. If the IC interpreter hits an unimplemented
    264 * opcode, it acts as if a guard had failed, and transfers to the next
    265 * stub in the chain. Every chain ends with a fallback stub that can
    266 * handle every case (it does not execute CacheIR at all, but instead
    267 * calls into the runtime), so this will always give the correct
    268 * result, albeit more slowly. Implementing the remainder of the
    269 * CacheIR opcodes, and new ones as they are added, is thus purely a
    270 * performance concern.
    271 *
    272 * PBL currently does not implement async resume into a suspended
    273 * generator. There is no particular reason that this cannot be
    274 * implemented; it just has not been done yet. Such an action will
    275 * currently call back into the C++ interpreter to run the resumed
    276 * generator body. Execution up to the first yield-point can still
    277 * occur in PBL, and PBL can successfully save the suspended state.
    278 */
    279 
    280 #include "jspubtd.h"
    281 
    282 #include "jit/BaselineFrame.h"
    283 #include "jit/BaselineIC.h"
    284 #include "jit/JitContext.h"
    285 #include "jit/JitRuntime.h"
    286 #include "jit/JitScript.h"
    287 #include "vm/Interpreter.h"
    288 #include "vm/Stack.h"
    289 
    290 namespace js {
    291 namespace pbl {
    292 
    293 // Trampoline invoked by EnterJit that sets up PBL state and invokes
    294 // the main interpreter loop.
    295 bool PortableBaselineTrampoline(JSContext* cx, size_t argc, Value* argv,
    296                                size_t numFormals, jit::CalleeToken calleeToken,
    297                                JSObject* envChain, Value* result);
    298 
    299 // Predicate: are all conditions satisfied to allow execution within
    300 // PBL? This depends only on properties of the function to be invoked,
    301 // and not on other runtime state, like the current stack depth, so if
    302 // it returns `true` once, it can be assumed to always return `true`
    303 // for that function. See `PortableBaselineInterpreterStackCheck`
    304 // below for a complimentary check that does not have this property.
    305 jit::MethodStatus CanEnterPortableBaselineInterpreter(JSContext* cx,
    306                                                      RunState& state);
    307 
    308 // A check for availbale stack space on the PBL auxiliary stack that
    309 // is invoked before the main trampoline. This is required for entry
    310 // into PBL and should be checked before invoking the trampoline
    311 // above. Unlike `CanEnterPortableBaselineInterpreter`, the result of
    312 // this check cannot be cached: it must be checked on each potential
    313 // entry.
    314 bool PortablebaselineInterpreterStackCheck(JSContext* cx, RunState& state,
    315                                           size_t numActualArgs);
    316 
    317 struct State;
    318 struct Stack;
    319 struct StackVal;
    320 struct StackValNative;
    321 struct ICRegs;
    322 class VMFrameManager;
    323 
    324 enum class PBIResult {
    325  Ok,
    326  Error,
    327  Unwind,
    328  UnwindError,
    329  UnwindRet,
    330 };
    331 
    332 template <bool IsRestart, bool HybridICs>
    333 PBIResult PortableBaselineInterpret(
    334    JSContext* cx_, State& state, Stack& stack, StackVal* sp,
    335    JSObject* envChain, Value* ret, jsbytecode* pc, ImmutableScriptData* isd,
    336    jsbytecode* restartEntryPC, jit::BaselineFrame* restartFrame,
    337    StackVal* restartEntryFrame, PBIResult restartCode);
    338 
    339 uint8_t* GetPortableFallbackStub(jit::BaselineICFallbackKind kind);
    340 uint8_t* GetICInterpreter();
    341 
    342 } /* namespace pbl */
    343 } /* namespace js */
    344 
    345 #endif /* vm_PortableBaselineInterpret_h */