tor

The Tor anonymity network
git clone https://git.dasho.dev/tor.git
Log | Files | Refs | README | LICENSE

commit 3ae87c3c7f1fb79744a343c0033afa24520a56d6
parent 65013a6924e5b97987819d93b4b8dc1acbee3c1e
Author: Nick Mathewson <nickm@torproject.org>
Date:   Wed,  6 Nov 2019 12:50:57 -0500

Turn the "dataflow" document into a doxygen page.

Diffstat:
Ddoc/HACKING/design/02-dataflow.md | 236-------------------------------------------------------------------------------
Asrc/core/or/dataflow.dox | 238+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Msrc/mainpage.dox | 16+++++++++++++++-
3 files changed, 253 insertions(+), 237 deletions(-)

diff --git a/doc/HACKING/design/02-dataflow.md b/doc/HACKING/design/02-dataflow.md @@ -1,236 +0,0 @@ - -## Data flow in the Tor process ## - -We read bytes from the network, we write bytes to the network. For the -most part, the bytes we write correspond roughly to bytes we have read, -with bits of cryptography added in. - -The rest is a matter of details. - -![Diagram of main data flows in Tor](./diagrams/02/02-dataflow.png "Diagram of main data flows in Tor") - -### Connections and buffers: reading, writing, and interpreting. ### - -At a low level, Tor's networking code is based on "connections". Each -connection represents an object that can send or receive network-like -events. For the most part, each connection has a single underlying TCP -stream (I'll discuss counterexamples below). - -A connection that behaves like a TCP stream has an input buffer and an -output buffer. Incoming data is -written into the input buffer ("inbuf"); data to be written to the -network is queued on an output buffer ("outbuf"). - -Buffers are implemented in buffers.c. Each of these buffers is -implemented as a linked queue of memory extents, in the style of classic -BSD mbufs, or Linux skbufs. - -A connection's reading and writing can be enabled or disabled. Under -the hood, this functionality is implemented using libevent events: one -for reading, one for writing. These events are turned on/off in -main.c, in the functions connection_{start,stop}_{reading,writing}. - -When a read or write event is turned on, the main libevent loop polls -the kernel, asking which sockets are ready to read or write. (This -polling happens in the event_base_loop() call in run_main_loop_once() -in main.c.) When libevent finds a socket that's ready to read or write, -it invokes conn_{read,write}_callback(), also in main.c - -These callback functions delegate to connection_handle_read() and -connection_handle_write() in connection.c, which read or write on the -network as appropriate, possibly delegating to openssl. - -After data is read or written, or other event occurs, these -connection_handle_read_write() functions call logic functions whose job is -to respond to the information. Some examples included: - - * connection_flushed_some() -- called after a connection writes any - amount of data from its outbuf. - * connection_finished_flushing() -- called when a connection has - emptied its outbuf. - * connection_finished_connecting() -- called when an in-process connection - finishes making a remote connection. - * connection_reached_eof() -- called after receiving a FIN from the - remote server. - * connection_process_inbuf() -- called when more data arrives on - the inbuf. - -These functions then call into specific implementations depending on -the type of the connection. For example, if the connection is an -edge_connection_t, connection_reached_eof() will call -connection_edge_reached_eof(). - -> **Note:** "Also there are bufferevents!" We have vestigial -> code for an alternative low-level networking -> implementation, based on Libevent's evbuffer and bufferevent -> code. These two object types take on (most of) the roles of -> buffers and connections respectively. It isn't working in today's -> Tor, due to code rot and possible lingering libevent bugs. More -> work is needed; it would be good to get this working efficiently -> again, to have IOCP support on Windows. - - -#### Controlling connections #### - -A connection can have reading or writing enabled or disabled for a -wide variety of reasons, including: - - * Writing is disabled when there is no more data to write - * For some connection types, reading is disabled when the inbuf is - too full. - * Reading/writing is temporarily disabled on connections that have - recently read/written enough data up to their bandwidth - * Reading is disabled on connections when reading more data from them - would require that data to be buffered somewhere else that is - already full. - -Currently, these conditions are checked in a diffuse set of -increasingly complex conditional expressions. In the future, it could -be helpful to transition to a unified model for handling temporary -read/write suspensions. - -#### Kinds of connections #### - -Today Tor has the following connection and pseudoconnection types. -For the most part, each type of channel has an associated C module -that implements its underlying logic. - -**Edge connections** receive data from and deliver data to points -outside the onion routing network. See `connection_edge.c`. They fall into two types: - -**Entry connections** are a type of edge connection. They receive data -from the user running a Tor client, and deliver data to that user. -They are used to implement SOCKSPort, TransPort, NATDPort, and so on. -Sometimes they are called "AP" connections for historical reasons (it -used to stand for "Application Proxy"). - -**Exit connections** are a type of edge connection. They exist at an -exit node, and transmit traffic to and from the network. - -(Entry connections and exit connections are also used as placeholders -when performing a remote DNS request; they are not decoupled from the -notion of "stream" in the Tor protocol. This is implemented partially -in `connection_edge.c`, and partially in `dnsserv.c` and `dns.c`.) - -**OR connections** send and receive Tor cells over TLS, using some -version of the Tor link protocol. Their implementation is spread -across `connection_or.c`, with a bit of logic in `command.c`, -`relay.c`, and `channeltls.c`. - -**Extended OR connections** are a type of OR connection for use on -bridges using pluggable transports, so that the PT can tell the bridge -some information about the incoming connection before passing on its -data. They are implemented in `ext_orport.c`. - -**Directory connections** are server-side or client-side connections -that implement Tor's HTTP-based directory protocol. These are -instantiated using a socket when Tor is making an unencrypted HTTP -connection. When Tor is tunneling a directory request over a Tor -circuit, directory connections are implemented using a linked -connection pair (see below). Directory connections are implemented in -`directory.c`; some of the server-side logic is implemented in -`dirserver.c`. - -**Controller connections** are local connections to a controller -process implementing the controller protocol from -control-spec.txt. These are in `control.c`. - -**Listener connections** are not stream oriented! Rather, they wrap a -listening socket in order to detect new incoming connections. They -bypass most of stream logic. They don't have associated buffers. -They are implemented in `connection.c`. - -![structure hierarchy for connection types](./diagrams/02/02-connection-types.png "structure hierarchy for connection types") - ->**Note**: "History Time!" You might occasionally find reference to a couple types of connections -> which no longer exist in modern Tor. A *CPUWorker connection* ->connected the main Tor process to a thread or process used for ->computation. (Nowadays we use in-process communication.) Even more ->anciently, a *DNSWorker connection* connected the main tor process to ->a separate thread or process used for running `gethostbyname()` or ->`getaddrinfo()`. (Nowadays we use Libevent's evdns facility to ->perform DNS requests asynchronously.) - -#### Linked connections #### - -Sometimes two channels are joined together, such that data which the -Tor process sends on one should immediately be received by the same -Tor process on the other. (For example, when Tor makes a tunneled -directory connection, this is implemented on the client side as a -directory connection whose output goes, not to the network, but to a -local entry connection. And when a directory receives a tunnelled -directory connection, this is implemented as an exit connection whose -output goes, not to the network, but to a local directory connection.) - -The earliest versions of Tor to support linked connections used -socketpairs for the purpose. But using socketpairs forced us to copy -data through kernelspace, and wasted limited file descriptors. So -instead, a pair of connections can be linked in-process. Each linked -connection has a pointer to the other, such that data written on one -is immediately readable on the other, and vice versa. - -### From connections to channels ### - -There's an abstraction layer above OR connections (the ones that -handle cells) and below cells called **Channels**. A channel's -purpose is to transmit authenticated cells from one Tor instance -(relay or client) to another. - -Currently, only one implementation exists: Channel_tls, which sends -and receiveds cells over a TLS-based OR connection. - -Cells are sent on a channel using -`channel_write_{,packed_,var_}cell()`. Incoming cells arrive on a -channel from its backend using `channel_queue*_cell()`, and are -immediately processed using `channel_process_cells()`. - -Some cell types are handled below the channel layer, such as those -that affect handshaking only. And some others are passed up to the -generic cross-channel code in `command.c`: cells like `DESTROY` and -`CREATED` are all trivial to handle. But relay cells -require special handling... - -### From channels through circuits ### - -When a relay cell arrives on an existing circuit, it is handled in -`circuit_receive_relay_cell()` -- one of the innermost functions in -Tor. This function encrypts or decrypts the relay cell as -appropriate, and decides whether the cell is intended for the current -hop of the circuit. - -If the cell *is* intended for the current hop, we pass it to -`connection_edge_process_relay_cell()` in `relay.c`, which acts on it -based on its relay command, and (possibly) queues its data on an -`edge_connection_t`. - -If the cell *is not* intended for the current hop, we queue it for the -next channel in sequence with `append cell_to_circuit_queue()`. This -places the cell on a per-circuit queue for cells headed out on that -particular channel. - -### Sending cells on circuits: the complicated bit. ### - -Relay cells are queued onto circuits from one of two (main) sources: -reading data from edge connections, and receiving a cell to be relayed -on a circuit. Both of these sources place their cells on cell queue: -each circuit has one cell queue for each direction that it travels. - -A naive implementation would skip using cell queues, and instead write -each outgoing relay cell. (Tor did this in its earlier versions.) -But such an approach tends to give poor performance, because it allows -high-volume circuits to clog channels, and it forces the Tor server to -send data queued on a circuit even after that circuit has been closed. - -So by using queues on each circuit, we can add cells to each channel -on a just-in-time basis, choosing the cell at each moment based on -a performance-aware algorithm. - -This logic is implemented in two main modules: `scheduler.c` and -`circuitmux*.c`. The scheduler code is responsible for determining -globally, across all channels that could write cells, which one should -next receive queued cells. The circuitmux code determines, for all -of the circuits with queued cells for a channel, which one should -queue the next cell. - -(This logic applies to outgoing relay cells only; incoming relay cells -are processed as they arrive.) diff --git a/src/core/or/dataflow.dox b/src/core/or/dataflow.dox @@ -0,0 +1,238 @@ +/** +@tableofcontents + +@page dataflow Data flow in the Tor process + +We read bytes from the network, we write bytes to the network. For the +most part, the bytes we write correspond roughly to bytes we have read, +with bits of cryptography added in. + +The rest is a matter of details. + +### Connections and buffers: reading, writing, and interpreting. + +At a low level, Tor's networking code is based on "connections". Each +connection represents an object that can send or receive network-like +events. For the most part, each connection has a single underlying TCP +stream (I'll discuss counterexamples below). + +A connection that behaves like a TCP stream has an input buffer and an +output buffer. Incoming data is +written into the input buffer ("inbuf"); data to be written to the +network is queued on an output buffer ("outbuf"). + +Buffers are implemented in buffers.c. Each of these buffers is +implemented as a linked queue of memory extents, in the style of classic +BSD mbufs, or Linux skbufs. + +A connection's reading and writing can be enabled or disabled. Under +the hood, this functionality is implemented using libevent events: one +for reading, one for writing. These events are turned on/off in +main.c, in the functions connection_{start,stop}_{reading,writing}. + +When a read or write event is turned on, the main libevent loop polls +the kernel, asking which sockets are ready to read or write. (This +polling happens in the event_base_loop() call in run_main_loop_once() +in main.c.) When libevent finds a socket that's ready to read or write, +it invokes conn_{read,write}_callback(), also in main.c + +These callback functions delegate to connection_handle_read() and +connection_handle_write() in connection.c, which read or write on the +network as appropriate, possibly delegating to openssl. + +After data is read or written, or other event occurs, these +connection_handle_read_write() functions call logic functions whose job is +to respond to the information. Some examples included: + + * connection_flushed_some() -- called after a connection writes any + amount of data from its outbuf. + * connection_finished_flushing() -- called when a connection has + emptied its outbuf. + * connection_finished_connecting() -- called when an in-process connection + finishes making a remote connection. + * connection_reached_eof() -- called after receiving a FIN from the + remote server. + * connection_process_inbuf() -- called when more data arrives on + the inbuf. + +These functions then call into specific implementations depending on +the type of the connection. For example, if the connection is an +edge_connection_t, connection_reached_eof() will call +connection_edge_reached_eof(). + +> **Note:** "Also there are bufferevents!" We have vestigial +> code for an alternative low-level networking +> implementation, based on Libevent's evbuffer and bufferevent +> code. These two object types take on (most of) the roles of +> buffers and connections respectively. It isn't working in today's +> Tor, due to code rot and possible lingering libevent bugs. More +> work is needed; it would be good to get this working efficiently +> again, to have IOCP support on Windows. + + +#### Controlling connections #### + +A connection can have reading or writing enabled or disabled for a +wide variety of reasons, including: + + * Writing is disabled when there is no more data to write + * For some connection types, reading is disabled when the inbuf is + too full. + * Reading/writing is temporarily disabled on connections that have + recently read/written enough data up to their bandwidth + * Reading is disabled on connections when reading more data from them + would require that data to be buffered somewhere else that is + already full. + +Currently, these conditions are checked in a diffuse set of +increasingly complex conditional expressions. In the future, it could +be helpful to transition to a unified model for handling temporary +read/write suspensions. + +#### Kinds of connections #### + +Today Tor has the following connection and pseudoconnection types. +For the most part, each type of channel has an associated C module +that implements its underlying logic. + +**Edge connections** receive data from and deliver data to points +outside the onion routing network. See `connection_edge.c`. They fall into two types: + +**Entry connections** are a type of edge connection. They receive data +from the user running a Tor client, and deliver data to that user. +They are used to implement SOCKSPort, TransPort, NATDPort, and so on. +Sometimes they are called "AP" connections for historical reasons (it +used to stand for "Application Proxy"). + +**Exit connections** are a type of edge connection. They exist at an +exit node, and transmit traffic to and from the network. + +(Entry connections and exit connections are also used as placeholders +when performing a remote DNS request; they are not decoupled from the +notion of "stream" in the Tor protocol. This is implemented partially +in `connection_edge.c`, and partially in `dnsserv.c` and `dns.c`.) + +**OR connections** send and receive Tor cells over TLS, using some +version of the Tor link protocol. Their implementation is spread +across `connection_or.c`, with a bit of logic in `command.c`, +`relay.c`, and `channeltls.c`. + +**Extended OR connections** are a type of OR connection for use on +bridges using pluggable transports, so that the PT can tell the bridge +some information about the incoming connection before passing on its +data. They are implemented in `ext_orport.c`. + +**Directory connections** are server-side or client-side connections +that implement Tor's HTTP-based directory protocol. These are +instantiated using a socket when Tor is making an unencrypted HTTP +connection. When Tor is tunneling a directory request over a Tor +circuit, directory connections are implemented using a linked +connection pair (see below). Directory connections are implemented in +`directory.c`; some of the server-side logic is implemented in +`dirserver.c`. + +**Controller connections** are local connections to a controller +process implementing the controller protocol from +control-spec.txt. These are in `control.c`. + +**Listener connections** are not stream oriented! Rather, they wrap a +listening socket in order to detect new incoming connections. They +bypass most of stream logic. They don't have associated buffers. +They are implemented in `connection.c`. + +![structure hierarchy for connection types](./diagrams/02/02-connection-types.png "structure hierarchy for connection types") + +>**Note**: "History Time!" You might occasionally find reference to a couple types of connections +> which no longer exist in modern Tor. A *CPUWorker connection* +>connected the main Tor process to a thread or process used for +>computation. (Nowadays we use in-process communication.) Even more +>anciently, a *DNSWorker connection* connected the main tor process to +>a separate thread or process used for running `gethostbyname()` or +>`getaddrinfo()`. (Nowadays we use Libevent's evdns facility to +>perform DNS requests asynchronously.) + +#### Linked connections #### + +Sometimes two channels are joined together, such that data which the +Tor process sends on one should immediately be received by the same +Tor process on the other. (For example, when Tor makes a tunneled +directory connection, this is implemented on the client side as a +directory connection whose output goes, not to the network, but to a +local entry connection. And when a directory receives a tunnelled +directory connection, this is implemented as an exit connection whose +output goes, not to the network, but to a local directory connection.) + +The earliest versions of Tor to support linked connections used +socketpairs for the purpose. But using socketpairs forced us to copy +data through kernelspace, and wasted limited file descriptors. So +instead, a pair of connections can be linked in-process. Each linked +connection has a pointer to the other, such that data written on one +is immediately readable on the other, and vice versa. + +### From connections to channels ### + +There's an abstraction layer above OR connections (the ones that +handle cells) and below cells called **Channels**. A channel's +purpose is to transmit authenticated cells from one Tor instance +(relay or client) to another. + +Currently, only one implementation exists: Channel_tls, which sends +and receiveds cells over a TLS-based OR connection. + +Cells are sent on a channel using +`channel_write_{,packed_,var_}cell()`. Incoming cells arrive on a +channel from its backend using `channel_queue*_cell()`, and are +immediately processed using `channel_process_cells()`. + +Some cell types are handled below the channel layer, such as those +that affect handshaking only. And some others are passed up to the +generic cross-channel code in `command.c`: cells like `DESTROY` and +`CREATED` are all trivial to handle. But relay cells +require special handling... + +### From channels through circuits ### + +When a relay cell arrives on an existing circuit, it is handled in +`circuit_receive_relay_cell()` -- one of the innermost functions in +Tor. This function encrypts or decrypts the relay cell as +appropriate, and decides whether the cell is intended for the current +hop of the circuit. + +If the cell *is* intended for the current hop, we pass it to +`connection_edge_process_relay_cell()` in `relay.c`, which acts on it +based on its relay command, and (possibly) queues its data on an +`edge_connection_t`. + +If the cell *is not* intended for the current hop, we queue it for the +next channel in sequence with `append cell_to_circuit_queue()`. This +places the cell on a per-circuit queue for cells headed out on that +particular channel. + +### Sending cells on circuits: the complicated bit. + +Relay cells are queued onto circuits from one of two (main) sources: +reading data from edge connections, and receiving a cell to be relayed +on a circuit. Both of these sources place their cells on cell queue: +each circuit has one cell queue for each direction that it travels. + +A naive implementation would skip using cell queues, and instead write +each outgoing relay cell. (Tor did this in its earlier versions.) +But such an approach tends to give poor performance, because it allows +high-volume circuits to clog channels, and it forces the Tor server to +send data queued on a circuit even after that circuit has been closed. + +So by using queues on each circuit, we can add cells to each channel +on a just-in-time basis, choosing the cell at each moment based on +a performance-aware algorithm. + +This logic is implemented in two main modules: `scheduler.c` and +`circuitmux*.c`. The scheduler code is responsible for determining +globally, across all channels that could write cells, which one should +next receive queued cells. The circuitmux code determines, for all +of the circuits with queued cells for a channel, which one should +queue the next cell. + +(This logic applies to outgoing relay cells only; incoming relay cells +are processed as they arrive.) + +**/ diff --git a/src/mainpage.dox b/src/mainpage.dox @@ -1,7 +1,9 @@ /** @mainpage Tor source reference -@section intro Welcome to Tor +@tableofcontents + +@section welcome Welcome to Tor This documentation describes the general structure of the Tor codebase, how it fits together, what functionality is available for extending Tor, and @@ -24,6 +26,18 @@ development tools, see [doc/HACKING](https://gitweb.torproject.org/tor.git/tree/doc/HACKING) in the Tor repository. +@section topics Topic-related documentation + +@subpage intro + +@subpage dataflow +**/ + +/** +@page intro A high-level overview + +@tableofcontents + @section highlevel The very high level Ultimately, Tor runs as an event-driven network daemon: it responds to