PortableBaselineInterpret.h (16973B)
1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- 2 * vim: set ts=8 sts=2 et sw=2 tw=80: 3 * This Source Code Form is subject to the terms of the Mozilla Public 4 * License, v. 2.0. If a copy of the MPL was not distributed with this 5 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ 6 7 #ifndef vm_PortableBaselineInterpret_h 8 #define vm_PortableBaselineInterpret_h 9 10 /* 11 * [SMDOC] Portable Baseline Interpreter 12 * ===================================== 13 * 14 * The Portable Baseline Interpreter (PBL) is a portable interpreter 15 * that supports executing ICs by directly interpreting CacheIR. 16 * 17 * This interpreter tier fits into the hierarchy between the C++ 18 * interpreter, which is fully generic and does not specialize with 19 * ICs, and the native baseline interpreter, which does attach and 20 * execute ICs but requires native codegen (JIT). The distinguishing 21 * feature of PBL is that it *does not require codegen*: it can run on 22 * any platform for which SpiderMonkey supports an interpreter-only 23 * build. This is useful both for platforms that do not support 24 * runtime addition of new code (e.g., running within a WebAssembly 25 * module with a `wasm32-wasi` build) or may disallow it for security 26 * reasons. 27 * 28 * The main idea of PBL is to emulate, as much as possible, how the 29 * native baseline interpreter works, so that the rest of the engine 30 * can work the same either way. The main aspect of this "emulation" 31 * comes with stack frames: unlike the native blinterp and JIT tiers, 32 * we cannot use the machine stack, because we are still executing in 33 * portable C++ code and the platform's C++ compiler controls the 34 * machine stack's layout. Instead, we use an auxiliary stack. 35 * 36 * Auxiliary Stack 37 * --------------- 38 * 39 * PBL creates baseline stack frames (see `BaselineFrame` and related 40 * structs) on an *auxiliary stack*, contiguous memory allocated and 41 * owned by the JSRuntime. 42 * 43 * This stack operates nearly identically to the machine stack: it 44 * grows downward, we push stack frames, we maintain a linked list of 45 * frame pointers, and a series of contiguous frames form a 46 * `JitActivation`, with the most recent activation reachable from the 47 * `JSContext`. The only actual difference is that the return address 48 * slots in the frame layouts are always null pointers, because there 49 * is no need to save return addresses: we always know where we are 50 * going to return to (either post-IC code -- the return point of 51 * which is known because we actually do a C++-level call from the 52 * JSOp interpreter to the IC interpreter -- or to dispatch the next 53 * JSOp). 54 * 55 * The same invariants as for native baseline code apply here: when we 56 * are in `PortableBaselineInterpret` (the PBL interpreter body) or 57 * `ICInterpretOps` (the IC interpreter) or related helpers, it is as 58 * if we are in JIT code, and local state determines the top-of-stack 59 * and innermost frame. The activation is not "finished" and cannot be 60 * traversed. When we need to call into the rest of SpiderMonkey, we 61 * emulate how that would work in JIT code, via an exit frame (that 62 * would ordinarily be pushed by a trampoline) and saving that frame 63 * as the exit-frame pointer in the VM state. 64 * 65 * To add a little compile-time enforcement of this strategy, and 66 * ensure that we don't accidentally call something that will want to 67 * traverse the (in-progress and not-completed) JIT activation, we use 68 * a helper class `VMFrame` that pushes and pops the exit frame, 69 * wrapping the callsite into the rest of SM with an RAII idiom. Then, 70 * we *hide the `JSContext`*, and rely on the idiom that `cx` is 71 * passed to anything that can GC or otherwise observe the JIT 72 * state. The `JSContext` is passed in as `cx_`, and we name the 73 * `VMFrame` local `cx` in the macro that invokes it; this `cx` then 74 * has an implicit conversion to a `JSContext*` value and reveals the 75 * real context. 76 * 77 * Interpreter Loops 78 * ----------------- 79 * 80 * There are two interpreter loops: the JSOp interpreter and the 81 * CacheIR interpreter. These closely correspond to (i) the blinterp 82 * body that is generated at startup for the native baseline 83 * interpreter, and (ii) an interpreter version of the code generated 84 * by the `BaselineCacheIRCompiler`, respectively. 85 * 86 * Execution begins in the JSOp interpreter, and for any op(*) that 87 * has an IC site (`JOF_IC` flag), we invoke the IC interpreter. The 88 * IC interpreter runs a loop that traverses the IC stub chain, either 89 * reaching CacheIR bytecode and executing it in a virtual machine, or 90 * reaching the fallback stub and executing that (likely pushing an 91 * exit frame and calling into the rest of SpiderMonkey). 92 * 93 * (*) As an optimization, some opcodes that would have IC sites in 94 * native baseline skip their IC chains and run generic code instead 95 * in PBL. See "Hybrid IC mode" below for more details. 96 * 97 * IC Interpreter State 98 * -------------------- 99 * 100 * While the JS opcode interpreter's abstract machine model and its 101 * mapping of those abstract semantics to real machine state are 102 * well-defined (by the other baseline tiers), the IC interpreter's 103 * mapping is less so. When executing in native baseline tiers, 104 * CacheIR is compiled to machine code that undergoes register 105 * allocation and several optimizations (e.g., handling constants 106 * specially, and eliding type-checks on values when we know their 107 * actual types). No other interpreter for CacheIR exists, so we get 108 * to define how we map the semantics to interpreter state. 109 * 110 * We choose to keep an array of uint64_t values as "virtual 111 * registers", each corresponding to a particular OperandId, and we 112 * store the same values that would exist in the native machine 113 * registers. In other words, we do not do any sort of register 114 * allocation or reclamation of storage slots, because we don't have 115 * any lookahead in the interpreter. We rely on the typesafe writer 116 * API, with newtype'd wrappers for different kinds of values 117 * (`ValOperandId`, `ObjOperandId`, `Int32OperandId`, etc.), producing 118 * typesafe CacheIR bytecode, in order to properly store and interpret 119 * unboxed values in the virtual registers. 120 * 121 * There are several subtle details usually handled by register 122 * allocation in the CacheIR compilers that need to be handled here 123 * too, mainly around input arguments and restoring state when 124 * chaining to the next IC stub. IC callsites place inputs into the 125 * first N OperandId registers directly, corresponding to what the 126 * CacheIR expects. There are some CacheIR opcodes that mutate their 127 * argument in-place (e.g., guarding that a Value is an Object strips 128 * the tag-bits from the Value and turns it into a raw pointer), so we 129 * cannot rely on these remaining unmodified if we need to invoke the 130 * next IC in the chain; instead, we save and restore the first N 131 * values in the chain-walking loop (according to the arity of the IC 132 * kind). 133 * 134 * Optimizations 135 * ------------ 136 * 137 * There are several implementation details that are critical for 138 * performance, and thus should be carefully maintained or verified 139 * with any changes: 140 * 141 * - Caching values in locals: in order to be competitive with "native 142 * baseline interpreter", which has the advantage of using machine 143 * registers for commonly-accessed values such as the 144 * top-of-operand-stack and the JS opcode PC, we are careful to 145 * ensure that the C++ compiler can keep these values in registers 146 * in PBL as well. One might naively store `pc`, `sp`, `fp`, and the 147 * like in a context struct (of "virtual CPU registers") that is 148 * passed to e.g. the IC interpreter. This would be a mistake: if 149 * the values exist in memory, the compiler cannot "lift" them to 150 * locals that can live in registers, and so every push and pop (for 151 * example) performs a store. This overhead is significant, 152 * especially when executing more "lightweight" opcodes. 153 * 154 * We make use of an important property -- the balanced-stack 155 * invariant -- so that we can pass SP *into* calls but not take an 156 * updated SP *from* them. When invoking an IC, we expect that when 157 * it returns, SP will be at the same location (one could think of 158 * SP as a "callee-saved register", though it's not usually 159 * described that way). Thus, we can avoid a dependency on a value 160 * that would have to be passed back through memory. 161 * 162 * - Hybrid IC mode: the fact that we *interpret* ICs now means that 163 * they are more expensive to invoke. Whereas a small IC that guards 164 * two int32 arguments, performs an int32 add, and returns might 165 * have been a handful of instructions before, and the call/ret pair 166 * would have been very fast (and easy to predict) instructions at 167 * the machine level, the setup and context transition and the 168 * CacheIR opcode dispatch overhead would likely be much slower than 169 * a generic "if both int32, add" fastpath in the interpreter case 170 * for `JSOp::Add`. 171 * 172 * We thus take a hybrid approach, and include these static 173 * fastpaths for what would have been ICs in "native 174 * baseline". These are enabled by the `kHybridICs` global and may 175 * be removed in the future (transitioning back to ICs) if/when we 176 * can reduce the cost of interpreted ICs further. 177 * 178 * Right now, calls and property accesses use ICs: 179 * 180 * - Calls can often be special-cased with CacheIR when intrinsics 181 * are invoked. For example, a call to `String.length` can turn 182 * into a CacheIR opcode that directly reads a `JSString`'s length 183 * field. 184 * - Property accesses are so frequent, and the shape-lookup path 185 * is slow enough, that it still makes sense to guard on shape 186 * and quickly return a particular slot. 187 * 188 * - Static branch prediction for opcode dispatch: we adopt an 189 * interpreter optimization we call "static branch prediction": when 190 * one opcode is often followed by another, it is often more 191 * efficient to check for those specific cases first and branch 192 * directly to the case for the following opcode, doing the full 193 * switch otherwise. This is especially true when the indirect 194 * branches used by `switch` statements or computed gotos are 195 * expensive on a given platform, such as Wasm. 196 * 197 * - Inlining: on some platforms, calls are expensive, and we want to 198 * avoid them whenever possible. We have found that it is quite 199 * important for performance to inline the IC interpreter into the 200 * JSOp interpreter at IC sites: both functions are quite large, 201 * with significant local state, and so otherwise, each IC call 202 * involves a lot of "context switching" as the code generated by 203 * the C++ compiler saves registers and constructs a new native 204 * frame. This is certainly a code-size tradeoff, but we have 205 * optimized for speed here. 206 * 207 * - Amortized stack checks: a naive interpreter implementation would 208 * check for auxiliary stack overflow on every push. We instead do 209 * this once when we enter a new JS function frame, using the 210 * script's precomputed "maximum stack depth" value. We keep a small 211 * stack margin always available, so that we have enough space to 212 * push an exit frame and invoke the "over-recursed" helper (which 213 * throws an exception) when we would otherwise overflow. The stack 214 * checks take this margin into account, failing if there would be 215 * less than the margin available at any point in the called 216 * function. 217 * 218 * - Fastpaths for calls and returns: we are able to push and pop JS 219 * stack frames while remaining in one native (C++ interpreter 220 * function) frame, just as the C++ interpreter does. This means 221 * that there is a one-to-many mapping from native stack frame to JS 222 * stack frame. This does create some complications at points that 223 * pop frames: we might remain in the same C++ frame, or we might 224 * return at the C++ level. We handle this in a unified way for 225 * returns and exception unwinding as described below. 226 * 227 * Unwinding 228 * --------- 229 * 230 * Because one C++ interpreter frame can correspond to multiple JS 231 * frames, we need to disambiguate the two cases whenever leaving a 232 * frame: we may need to return, or we may stay in the current 233 * function and dispatch the next opcode at the caller's next PC. 234 * 235 * Exception unwinding compilcates this further. PBL uses the same 236 * exception-handling code that native baseline does, and this code 237 * computes a `ResumeFromException` struct that tells us what our new 238 * stack pointer and frame pointer must be. These values could be 239 * arbitrarily far "up" the stack in the current activation. It thus 240 * wouldn't be sufficient to count how many JS frames we have, and 241 * return at the C++ level when this reaches zero: we need to "unwind" 242 * the C++ frames until we reach the appropriate JS frame. 243 * 244 * To solve both issues, we remember the "entry frame" when we enter a 245 * new invocation of `PortableBaselineInterpret()`, and when returning 246 * or unwinding, if the new frame is *above* this entry frame, we 247 * return. We have an enum `PBIResult` that can encode, when 248 * unwinding, *which* kind of unwinding we are doing, because when we 249 * do eventually reach the C++ frame that owns the newly active JS 250 * frame, we may resume into a different action depending on this 251 * information. 252 * 253 * Completeness 254 * ------------ 255 * 256 * Whenever a new JSOp is added, the opcode needs to be added to 257 * PBL. The compiler should enforce this: if no case is implemented 258 * for an opcode, then the label in the computed-goto table will be 259 * missing and PBL will not compile. 260 * 261 * In contrast, CacheIR opcodes need not be implemented right away, 262 * and in fact right now most of the less-common ones are not 263 * implemented by PBL. If the IC interpreter hits an unimplemented 264 * opcode, it acts as if a guard had failed, and transfers to the next 265 * stub in the chain. Every chain ends with a fallback stub that can 266 * handle every case (it does not execute CacheIR at all, but instead 267 * calls into the runtime), so this will always give the correct 268 * result, albeit more slowly. Implementing the remainder of the 269 * CacheIR opcodes, and new ones as they are added, is thus purely a 270 * performance concern. 271 * 272 * PBL currently does not implement async resume into a suspended 273 * generator. There is no particular reason that this cannot be 274 * implemented; it just has not been done yet. Such an action will 275 * currently call back into the C++ interpreter to run the resumed 276 * generator body. Execution up to the first yield-point can still 277 * occur in PBL, and PBL can successfully save the suspended state. 278 */ 279 280 #include "jspubtd.h" 281 282 #include "jit/BaselineFrame.h" 283 #include "jit/BaselineIC.h" 284 #include "jit/JitContext.h" 285 #include "jit/JitRuntime.h" 286 #include "jit/JitScript.h" 287 #include "vm/Interpreter.h" 288 #include "vm/Stack.h" 289 290 namespace js { 291 namespace pbl { 292 293 // Trampoline invoked by EnterJit that sets up PBL state and invokes 294 // the main interpreter loop. 295 bool PortableBaselineTrampoline(JSContext* cx, size_t argc, Value* argv, 296 size_t numFormals, jit::CalleeToken calleeToken, 297 JSObject* envChain, Value* result); 298 299 // Predicate: are all conditions satisfied to allow execution within 300 // PBL? This depends only on properties of the function to be invoked, 301 // and not on other runtime state, like the current stack depth, so if 302 // it returns `true` once, it can be assumed to always return `true` 303 // for that function. See `PortableBaselineInterpreterStackCheck` 304 // below for a complimentary check that does not have this property. 305 jit::MethodStatus CanEnterPortableBaselineInterpreter(JSContext* cx, 306 RunState& state); 307 308 // A check for availbale stack space on the PBL auxiliary stack that 309 // is invoked before the main trampoline. This is required for entry 310 // into PBL and should be checked before invoking the trampoline 311 // above. Unlike `CanEnterPortableBaselineInterpreter`, the result of 312 // this check cannot be cached: it must be checked on each potential 313 // entry. 314 bool PortablebaselineInterpreterStackCheck(JSContext* cx, RunState& state, 315 size_t numActualArgs); 316 317 struct State; 318 struct Stack; 319 struct StackVal; 320 struct StackValNative; 321 struct ICRegs; 322 class VMFrameManager; 323 324 enum class PBIResult { 325 Ok, 326 Error, 327 Unwind, 328 UnwindError, 329 UnwindRet, 330 }; 331 332 template <bool IsRestart, bool HybridICs> 333 PBIResult PortableBaselineInterpret( 334 JSContext* cx_, State& state, Stack& stack, StackVal* sp, 335 JSObject* envChain, Value* ret, jsbytecode* pc, ImmutableScriptData* isd, 336 jsbytecode* restartEntryPC, jit::BaselineFrame* restartFrame, 337 StackVal* restartEntryFrame, PBIResult restartCode); 338 339 uint8_t* GetPortableFallbackStub(jit::BaselineICFallbackKind kind); 340 uint8_t* GetICInterpreter(); 341 342 } /* namespace pbl */ 343 } /* namespace js */ 344 345 #endif /* vm_PortableBaselineInterpret_h */