tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

ranking.rst (12579B)


      1 =======
      2 Ranking
      3 =======
      4 
      5 Before results appear in the UrlbarView, they are fetched from providers.
      6 
      7 Each `UrlbarProvider <https://firefox-source-docs.mozilla.org/browser/urlbar/overview.html#urlbarprovider>`_
      8 implements its own internal ranking and returns sorted results.
      9 
     10 Externally all the results are ranked by the `UrlbarMuxer <https://searchfox.org/mozilla-central/source/browser/components/urlbar/UrlbarMuxerStandard.sys.mjs>`_
     11 according to an hardcoded list of groups and sub-groups.
     12 
     13 .. NOTE:: Preferences can influence the groups order, for example by putting
     14  Firefox Suggest before Search Suggestions.
     15 
     16 The Places provider returns history and bookmark results using an internal
     17 ranking algorithm called **Frecency**.
     18 
     19 
     20 Frecency implementation
     21 =======================
     22 
     23 Frecency is a term derived from `frequency` and `recency`, its scope is to
     24 provide a ranking algorithm that gives importance both to how often a page is
     25 accessed and when it was last visited. Additionally, it accounts for the type of
     26 each visit through a bonus system.
     27 
     28 Frecency uses a double exponential decay.
     29 
     30 Given a url, we identify a reference time, t\ :sub:`ref`, that will be either
     31 the last visit date, or the last bookmark addition date (for pages not having
     32 visits).
     33 
     34 All the dates we use will be expressed as the number of days from the epoch.
     35 
     36 Since calculating a score from all of the visits every time a new visit is added
     37 would be expensive, a sample of the last
     38 ``places.frecency.pages.numSampledVisits`` visits is used.
     39 
     40 For each visit a *weight* is determined out of four buckets:
     41 
     42 1. Very High (``places.frecency.pages.veryHighWeight``): This is used when a
     43   page is in the high bucket and has interesting interaction data.
     44 2. High (``places.frecency.pages.highWeight``): This is for bookmarked or typed
     45   pages. Due to the current definition of typed, this includes URLs clicked on
     46   the Firefox UI (not the web), except searches.
     47 3. Medium (``places.frecency.pages.mediumWeight``): This is for normal browsing,
     48   like clicking links on pages or downloads.
     49 4. Low (``places.frecency.pages.lowWeight``): This is for non-typed redirect
     50   sources, visits across HTML frames (non top level) or reloads.
     51 
     52 The score for the single visit is then calculated as:
     53 
     54 .. math::
     55   score_{visit} = weight_{visit} * e^{-\lambda(t_{ref}-t_{visit})}
     56 
     57 With half-life defined as:
     58 
     59 .. math::
     60   \lambda = \frac{\ln 2}{30}
     61 
     62 The total score is the average score, multiplied by the total number of visits:
     63 
     64 .. math::
     65   score_{total} = \frac{\sum score_{visit}}{samplecount} * visitcount
     66 
     67 To keep into account recent, this score is decayed daily using the already
     68 defined half-life. To avoid having to multiply each score daily by an
     69 exponential, we instead store in the database (`moz_places`) the number of days
     70 from epoch after which the calculated score would become 1, by decaying it every
     71 day.
     72 
     73 .. math::
     74   frecency = t_{ref} + \frac{\ln(score)}{\lambda}
     75 
     76 Interactions
     77 ------------
     78 
     79 Page view duration, keypress count, and scroll distance are tracked when a page
     80 loads. All data is stored locally and never transmitted.
     81 
     82 A page visit is classified as **interesting** if either criterion is met:
     83 
     84 - Page view time ≥ ``places.frecency.pages.interactions.viewTimeSeconds``.
     85 - Page view time ≥ ``places.frecency.pages.interactions.viewTimeIfManyKeypressesSeconds``
     86  AND keypress count ≥ ``places.frecency.pages.interactions.manyKeypresses``.
     87 
     88 When a visit is classified as interesting, its scoring bucket is promoted. For
     89 example, a visit initially categorized in the medium bucket is promoted to the
     90 high bucket if it has an interesting interaction.
     91 
     92 Interactions are stored separately from visits. To associate them, the query
     93 checks whether a visit was recorded within ``places.frecency.pages.interactions.maxVisitGapSeconds``
     94 of a recorded interaction. If a matching visit exists within this time window,
     95 the interaction is paired with it. Otherwise, the interaction becomes a
     96 **virtual visit**.
     97 
     98 A single visit can generate multiple interactions. When a user switches away
     99 from a tab, the current interaction ends if the user does not return within
    100 ``browser.places.interactions.breakupIfNoUpdatesForSeconds``. Returning to the
    101 tab after this threshold creates a new interaction, even though the original
    102 visit remains the same.
    103 
    104 How frecency for a page is calculated
    105 -------------------------------------
    106 
    107 .. mermaid::
    108    :align: center
    109    :caption: Frecency calculation flow
    110 
    111    flowchart TD
    112        start([Page URL Input])
    113        subgraph gather ["Gather"]
    114            visits[Gather recent visits<br/>up to sample limit]
    115            interactions[Gather interesting<br/>interactions]
    116            pair[Match interactions<br/>with visits by timestamp]
    117            virtual{Unpaired<br/>interaction?}
    118            generate[Create virtual visit<br/>from interaction]
    119            combine[Combine visits<br/>with virtual visits]
    120            hasVisits{Any visits<br/>collected?}
    121            isBookmarked{Is URL<br/>bookmarked?}
    122            bookmarkFallback[High<br/>Bucket]
    123            noData[frecency = 0]
    124        end
    125        subgraph classify ["Classify"]
    126            visitType{Visit<br/>Type?}
    127            bookmark_typed[Bookmark/Typed]
    128            regular_click[Regular Link]
    129            redirect_auto[Redirect/Sponsored]
    130            bookmarkInteraction{Interesting<br/>interaction?}
    131            regularInteraction{Interesting<br/>interaction?}
    132            score1[Very High<br/>Bucket]
    133            score2[High<br/>Bucket]
    134            score3[Medium<br/>Bucket]
    135            score4[Low<br/>Bucket]
    136        end
    137        calcDecay[Apply exponential decay<br/>based on visit age]
    138        aggregate[Sum decayed scores and<br/> normalize by visit count]
    139        project[frecency = days until<br/>score decays to 1]
    140        %% Main
    141        start --> visits & interactions
    142        visits --> pair
    143        interactions --> pair
    144        pair --> virtual
    145        virtual -->|Yes| generate
    146        virtual -->|No| combine
    147        generate --> combine
    148        combine --> hasVisits
    149        hasVisits -->|Yes| visitType
    150        hasVisits -->|No| isBookmarked
    151        isBookmarked -->|Yes| bookmarkFallback
    152        isBookmarked -->|No| noData
    153        bookmarkFallback --> calcDecay
    154        %% Classification
    155        visitType -->|Bookmark/Typed| bookmark_typed
    156        visitType -->|Regular| regular_click
    157        visitType -->|Redirect| redirect_auto
    158        bookmark_typed --> bookmarkInteraction
    159        regular_click --> regularInteraction
    160        redirect_auto --> score4
    161        bookmarkInteraction -->|Yes| score1
    162        bookmarkInteraction -->|No| score2
    163        regularInteraction -->|Yes| score2
    164        regularInteraction -->|No| score3
    165        %% Decay
    166        score1 & score2 & score3 & score4 --> calcDecay
    167        calcDecay --> aggregate
    168        %% Final
    169        aggregate --> project
    170        %% Styling
    171        style score1 fill:#4caf50,stroke:#2e7d32,stroke-width:3px,color:#fff
    172        style score2 fill:#8bc34a,stroke:#558b2f,stroke-width:2px
    173        style score3 fill:#ffc107,stroke:#f57f17,stroke-width:2px
    174        style bookmarkFallback fill:#8bc34a,stroke:#558b2f,stroke-width:2px
    175        style score4 fill:#ff9800,stroke:#e65100,stroke-width:2px
    176        style project fill:#2196f3,stroke:#0d47a1,stroke-width:3px,color:#fff
    177        style noData fill:#2196f3,stroke:#0d47a1,stroke-width:3px,color:#fff
    178 
    179 1. **Gather visits and interactions:** For a given URL, collect recent visits
    180   (up to the sample limit) and any interesting interactions.
    181 2. **Match and create virtual visits:** Pair each interesting interaction with
    182   its corresponding visit by timestamp. If an interaction has no paired visit,
    183   create a "virtual visit" from that interaction. Combine all actual visits
    184   with these virtual visits for scoring.
    185 3. **Check visits exist:** If no virtual or non-virtual visits were collected,
    186   check whether the URL is bookmarked. If bookmarked, assign it to the High
    187   bucket, otherwise set frecency to 0.
    188 4. **Classify visits into buckets:** Categorize each visit based on its type and
    189   whether it has an interesting interaction.
    190 5. **Apply exponential decay:** Reduce each visit's score based on its age
    191   using the time difference between the most recent visit in the sample vs.
    192   the visit timestamp.
    193 6. **Aggregate scores:** Sum all decayed scores and normalize by dividing by the
    194   total visit count to get an average decayed score.
    195 7. **Project to frecency value:** Convert the aggregated score into a timestamp
    196   representing the number of days from the Unix epoch until the score would
    197   decay to 1. This final value is the store frecency value.
    198 
    199 When frecency for a page is calculated
    200 --------------------------------------
    201 
    202 Operations that may influence the frecency score are:
    203 
    204 * Adding visits
    205 * Removing visits
    206 * Adding bookmarks
    207 * Removing bookmarks
    208 * Changing the url of a bookmark
    209 * An interesting interaction is recorded
    210 
    211 Frecency is recalculated:
    212 
    213 * Immediately, when a new visit is added. The user expectation here is that the
    214  page appears in search results after being visited. This is also valid for
    215  any History API that allows to add visits.
    216 * In background on idle times, in any other case. In most cases having a
    217  temporarily stale value is not a problem, the main concern would be privacy
    218  when removing history of a page, but removing whole history will either
    219  completely remove the page or, if it's bookmarked, it will still be relevant.
    220  In this case, when a change influencing frecency happens, the ``recalc_frecency``
    221  database field for the page is set to ``1``.
    222 
    223 Recalculation is done by the `PlacesFrecencyRecalculator <https://searchfox.org/mozilla-central/source/toolkit/components/places/PlacesFrecencyRecalculator.sys.mjs>`_ module.
    224 The Recalculator is notified when ``PlacesUtils.history.shouldStartFrecencyRecalculation``
    225 value changes from false to true, that means there's values to recalculate.
    226 A DeferredTask is armed, that will look for a user idle opportunity
    227 in the next 5 minutes, otherwise it will run when that time elapses.
    228 Once all the outdated values have been recalculated
    229 ``PlacesUtils.history.shouldStartFrecencyRecalculation`` is set back to false
    230 until the next operation invalidating a frecency.
    231 The recalculation task is also armed on the ``idle-daily`` notification.
    232 
    233 When the task is executed, it recalculates frecency of a chunk of pages. If
    234 there are more pages left to recalculate, the task is re-armed. After frecency
    235 of a page is recalculated, its ``recalc_frecency`` field is set back to ``0``.
    236 
    237 Adaptive Input History
    238 ======================
    239 
    240 Input History (also known as Adaptive History) is a feature that allows to
    241 find urls that the user previously picked. To do so, it associates search strings
    242 with picked urls.
    243 
    244 Adaptive results are usually presented before frecency derived results, making
    245 them appear as having an infinite frecency.
    246 
    247 When the user types a given string, and picks a result from the address bar, that
    248 relation is stored and increases a use_count field for the given string.
    249 The use_count field asymptotically approaches a max of ``10`` (the update is
    250 done as ``use_count * .9 + 1``).
    251 
    252 On querying, all the search strings that start with the input string are matched,
    253 a rank is calculated per each page as ``ROUND(MAX(use_count) * (1 + (input = :search_string)), 1)``,
    254 so that results perfectly matching the search string appear at the top.
    255 Results with the same rank are additionally sorted by descending frecency.
    256 
    257 On daily idles, when frecency is decayed, also input history gets decayed, in
    258 particular the use_count field is multiplied by a decay rate  of ``0.975``.
    259 After decaying, any entry that has a ``use_count < 0.975^90 (= 0.1)`` is removed,
    260 thus entries are removed if unused for 90 days.
    261 
    262 Why legacy frecency scoring was replaced
    263 ========================================
    264 
    265 :doc:`ranking-legacy` had the following issues:
    266 
    267 1. The score depended on the time at which it is calculated, making the user
    268   experience more unstable. For example removing a bookmark (affects the score
    269   negatively) will recalculate the score to a value higher than another
    270   untouched bookmarked url, just because it was recalculated later.
    271 2. All the scores must be decayed every day to keep recency into account. It is
    272   an expensive operation.
    273 3. It has far too many coefficients, making it impossible to study for
    274   optimizing those coefficients.
    275 4. Changing the coefficients doesn't recalculate all the scores, making
    276   experimentation more complicated.