This feature is similar to ChanProvenance from ChanMgr, except that
we don't yet need to report it outside the crate. I'm going to use
it to distinguish newly created circuits from existing circuits in
the preemptive circuit builder.
This lets us say that the UsageMismatch cases in some parts of the
code reflect a programming error (RetryTime::Never), whereas in
other case it reflects another circuit request getting to the
circuit first (RetryTime::Immediate).
Previously we did not distinguish errors that came from pending
circuits from errors that came from the circuits we were
building. We also reported errors as coming from "Left" or "Right",
instead of a more reasonable description.
We were treating restrict_mut() failures as internal errors, and
using internal errors to represent them. But in fact, these
failures are entirely possible based on timing. Here's how it
happens:
* Two different circuit requests arrive at the same time, and both
notice a pending circuit that they could use.
* The pending circuit completes; both pending requests are notified.
* The first request calls restrict_mut(), and restricts the request
in such a way that the second couldn't use it.
* The second request calls restrict_mut(), and gets a failure.
Because of this issue, we treat these errors as transient failures
and just wait for another circuit.
Closes#427.
(This is not a breaking API change, since `AbstractSpec` is a
crate-private trait.)
Not all of these strictly need to be bumped to 0.2.0; many could go
to 0.1.1 instead. But since everything at the tor-rtcompat and
higher layers has had breaking API changes, it seems not so useful
to distinguish. (It seems unlikely that anybody at this stage is
depending on e.g. tor-protover but not arti-client.)
The older default seems (experimentally) to be ridiculously high.
Generally, if we can't build a circuit within a handful attempts,
that circuit has already timed out... unless there is a fast-failure
condition, in which case we're just hammering the network (or our
view of it.)
Found with `arti-testing` for #329.
Previously, if we had launch_parallelism > 1, and we were willing to
retry building a circuit max_retries times, then we'd launch up to
max_retries * launch_parallelism circuits before giving up. Ouch!
With this patch, we try to keep the total number of circuits
planned and attempted to the actual max_retries limit.
Part of #329; found with arti-testing.
The FirstHopId type now records an enum that stores whether the hop
is a guard or a fallback. This change addresses concerns about
remembering to check the type or source of an Id before passing it
down to the FallbackState or GuardSet.
Making this change required an API change, so that dirmgr can
report success/failure status without actually knowing whether it's
using a fallback or a guard.
The code here uses a new iterator type, since I couldn't find one of
these on crates.io. I tried writing the code without it, but it was
harder to follow and test.
We do this by creating a new FallbackSet type that includes status
information, and updating the GuardMgr APIs to record success and
failure about it when appropriate. We can use this to mark
FallbackDirs retriable (or not).
With this change, FallbackDir is now stored internally as a Guard in
the GuardMgr crate. That's fine: the FallbackDir type really only
matters for configuration.
If we're building a path with the guard manager involved, we now ask
the guard manager to pick our first hop no matter what. We only
pick from the fallback list ourselves if we're using the API with no
guard manager.
This causes some follow-on changes where we have to remember an
OwnedChanTarget object in a TorPath we've built, and where we gain
the ability to say we're building a path "from nothing extra at
all." Those are all internal to the crate, though.
Closes#220, by making sure that we use our guards to get a fresh
netdir (if we can) before falling back to any fallbacks, even if our
consensus is old.
Compilation should be fixed in the next commit.
The guard manager is responsible for handing out the first hops of
tor circuits, keeping track of their successes and failures, and
remembering their states. Given that, it makes sense to store this
information here. It is not yet used; I'll be fixing that in
upcoming commits.
Arguably, this information no longer belongs in the directory
manager: I've added a todo about moving it.
This commit will break compilation on its own in a couple of places;
subsequent commits will fix it up.
This is the logical place for it, I think: the GuardMgr's job is to
pick the first hop for a circuit depending on remembered status for
possible first hops. Making this change will let us streamline the
code that interacts with these objects.
The various background daemon tasks that `arti-client` used to spawn are
now handled inside their respective crates instead, with functions
provided to spawn them that return `TaskHandle`s.
This required introducing a new trait, `NetDirProvider`, which steals
some functionality from the `DirProvider` trait to enable `tor-circmgr`
to depend on it (`tor-circmgr` is a dependency of `tor-dirmgr`, so it
can't depend on `DirProvider` directly).
While we're at it, we also make some of the tasks wait for events from
the `NetDirProvider` instead of sleeping, slightly increasing
efficiency.
Some error types indicate that the guard has failed as a dircache.
We should treat these errors as signs to close the circuit, and to
mark the guard as having failed.
We already have the ability to get peer information from ChanMgr
errors, and therefore from any RetryErrors that contain ChanMgr
errors.
This commit adds optional peer information to tor-proto errors, and
a function to expose whatever peer information is available.
It'll soon more convenient to pass in FallbackDirs as a slice of
references, rather than just a slice of FallbackDirs: I'm going to
be changing how we handle these in tor-dirmgr.
If all guards are down and they won't be retriable for a while, try
waiting that long to get whichever guard _is_ retriable.
Additionally, if we are making multiple circuit plans in parallel,
only report our planning as having failed if we failed at making
_all_ the plans. Previously we treated any failure as fatal for the
other plans, which could lead to trouble in the case when guards
were all down or pending.
Part of #407.
Instead of requiring a `Box<dyn Isolation>`, it now takes either a
`Box<dyn Isolation>`, or an arbitrary `T` that implements
`Isolation`.
This API still allows the user to pass in a `Box<dyn Isolation>` if
that's what they have, but it doesn't require them to Box the
isolation on their own.
Part of #414.
This has the different syntax for builder field attributes than what I
originally proposed in my MR, and which therefore is in the pinned
branch.
My upstream MR for the field attributes feature was morged:
https://github.com/colin-kiegel/rust-derive-builder/issues/239
This has the humantime_serde::option module, which we have upstreamed
and are about to switch to.
The remaining dependency with version = "1" is going to be removed
in a moment.
I used
git-grep -P '\#\[serde\((?!default|deny_unknown)'
to find places where I needed to add additional attributes on the
builder method fields.
This is currently a bit duplicative, but when #371 is completely done,
the validated (non-builder) configs won't need to be Deserialize any
more.
This is part of #371 and #372.
We are going to want to specify custom attributes on fields of the
builder struct. This feature was missing from derive_builder.
This commitid is the current head of my MR branch
https://github.com/colin-kiegel/rust-derive-builder/pull/237https://github.com/ijackson/rust-derive-builder/tree/builder-field-attrs
Using the commitid prevents surprises if that branch is updated.
We will require this newer version of derive_builder. The version
will need to be bumped again later, assuming the upstream MR is merged
and upstream do a release containing the needed changes.
When I wrote this, I arranged to skip dumping the field `pending`.
This must have been because I thought that either
(a) PendingEntry couldn't `#[derive(Debug)]` (but it can)
and/or
(b) Some of the fields of PendingEntry ought not to be dumped because
they might contain (eg) packet data. But I think they don't: there's
just the spec, and the Result which is (basically) a Circ.
I tried preseving something closer to the original using educe, but
educe gets somehow tangled up with the generics, and the result fails
to compile. I haven't investigated this further.
Fixes#365
Inspection of the code and logs shows that:
* One of the plan futures' oneshots must be returning Cancelled
* This means that the corresponding sender must have been dropped
* The sender is owned by the task spawned by spawn_launch
Presumably that entire task gets dropped as part of executor shutdown,
or something.
The correct response in this situation is to declare that we are
shutting down, and stop trying to do stuff.
Unfortunately, despite trying quite hard by putting sleeps in various
strategic places, I have not been able to reproduce the problem. So I
can't be 100% sure that the new behaviour is correct.
But I am reasonably confident that this ought not to be able to occur
unless either 1. the task from spawn_launch is dropped, or 2. that
task somehow panics despite its attempts to trap panics and report
them as errors through the oneshot.
So this "burn it all down" action ought only to occur in actually
serious situations.
I observe that
3ff9b187ea
Handle panics from circuit construction.
changed the EK for PendingCanceled to EK::ReactorShuttingDown,
and there's From impl. I think, therefore, that it is right
to reuse this Error variant.
I don't quite understand why when take_action gets an actual error it
doesn't push it, but just logs it. But I am not changing that for
now.
Arguably the two instances of retry_error.push are a sign of an
inferior flow control pattern - maybe the loop body including the code
I am adding ought to be an IEFE returning
`Result<Option<circ>, crate::Error>`.
I wanted this while debugging something.
The ad-hoc impl Debug with f.debug_struct is getting repetitive
and I've already perpetrated one copy-paste mistake.
We should consider using something like the `educe` crate's Clone.
This lint is IMO inherently ill-conceived.
I have looked for the reasons why this might be thought to be a good
idea and there were basically two (and they are sort of contradictory):
I. "Calling ‘.clone()` on an Rc, Arc, or Weak can obscure the fact
that only the pointer is being cloned, not the underlying data."
This is the wording from
https://rust-lang.github.io/rust-clippy/v0.0.212/#clone_on_ref_ptr
It is a bit terse; we are left to infer why it is a bad idea to
obscure this fact. It seems to me that if it is bad to obscure some
fact, that must be because the fact is a hazard. But why would it be
a hazard to not copy the underlying data ?
In other languages, faliing to copy the underlying data is a serious
correctness hazard. There is a whose class of bugs where things were
not copied, and then mutated and/or reused in multiple places in ways
that were not what the programmer intended. In my experience, this is
a very common bug when writing Python and Javascript. I'm told it's
common in golang too.
But in Rust this bug is much much harder to write. The data inside an
Arc is immutable. To have this bug you'd have use interior mutability
- ie mess around with Mutex or RefCell. That provides a good barrier
to these kind of accidents.
II. "The reason for writing Rc::clone and Arc::clone [is] to make it
clear that only the pointer is being cloned, as opposed to the
underlying data. The former is always fast, while the latter can
be very expensive depending on what is being cloned."
This is the reasoning found here
https://github.com/rust-lang/rust-clippy/issues/2048
This is saying that *not* using Arc::clone is hazardous.
Specifically, that a deep clone is a performance hazard.
But for this argument, the lint is precisely backwards. It's linting
the "good" case and asking for it to be written in a more explicit
way; while the supposedly bad case can be written conveniently.
Also, many objects (in our codebase, and in all the libraries we use)
that are Clone are in fact simply handles. They contain Arc(s) (or
similar) and are cheap to clone. Indeed, that is the usual case.
It does not make sense to distinguish in the syntax we use to clone
such a handle, whether the handle is a transparent Arc, or an opaque
struct containing one or more other handles.
Forcing Arc::clone to be written as such makes for code churn when a
type is changed from Arc<Something> to Something: Clone, or vice
versa.
To implement this, we had to refactor the tor_circmgr api for
flushing state changes to disk, so that it checks if it has the lock,
and only then tries to store.
We handle them by reporting them to task that's waiting for the
circuit, then relaying the panic.
Doing so allows the waiting task to distinguish panics
(EK::Internal) from cases where the reactor dropped the task
entirely (EK::ReactorShuttingDown). And doing _that_ removes one
case of EK::Canceled, which helps us on our goals towards #348.
Closes#347.
From its old name, this error had implied that we were giving no
useful information when we were waiting on a pending cirucit request
that failed. In fact, this error would only happen if we dropped the
`mpsc::Sender` for a circuit attempt without reporting success or
failure.
These errors should almost never be seen by the user; we should instead
retry the circuit. But they _can_ be seen by the use if selecting a
guard takes too long, or too many attempts. (Therefore, they aren't true
"internal" errors.)
I suspect that we might not want to keep this TransientFailure kind, but
I'm not sure what else to do here for now.
In one of the two places, nightly no longer warns. In the other
place, it's fine for nightly to warn: I just fixed the code to take
a slice instead.
Partial revert of 856aca8791.
Resolves part of #310.
Provide an enum variant to contain the SpawnError and a From impl.
We use `#[from]` here because it doesn't really make sense to attach
any context, as it's not likely to be very relevant.
The type annotation may not be necessary for inference, but as a
comment it risks becoming false. So it should be uncommented, or
deleted.
Error types round here are not entirely trivial so uncomment it.
Prompted by clippy::needless_question_mark. Sometimes Ok(r?) is
needed to do automatic error conversion. I assume the lint checks for
that. Anyway, in these cases it's not needed.
tor-netdir needs to bump because tor-netdoc bumped, even though
there were no other changes in tor-netdir. Whoops.
tor-guardmgr needs to bump because it already published, with the
older tor-netdir.
This commit puts the native-tls crate behind a feature. The feature
is off-by-default in the tor-rtcompat crate, but can be enabled
either from arti or arti-client.
There is an included script that I used to test that tor-rtcompat
could build and run its tests with all subsets of its features.
Closes#300
The docs even say this is about stream.
As @nickm writes in
https://gitlab.torproject.org/tpo/core/arti/-/merge_requests/252#note_2771289
we generally call end-to-end connections that are tunneled over Tor
"Streams" to distinguish them from everything else in the Tor
protocols that could possibly be called a "Connection".
That seems to apply here too.
This is totally not just an exercise to get combined test coverage
for tor-circmgr over 90% because I needed something to do that
wouldn't distract anybody else. :)
Since it implements a "<=" type relationship, it should be called
"at_least_as_permissive_as()." Since it's a crate-private function,
the long name isn't too bad.
I found these versions empirically, by using the following process:
First, I used `cargo tree --depth 1 --kind all` to get a list of
every immediate dependency we had.
Then, I used `cargo upgrade --workspace package@version` to change
each dependency to the earliest version with which (in theory) the
current version is semver-compatible. IOW, if the current version
was 3.2.3, I picked "3". If the current version was 0.12.8, I
picked "0.12".
Then, I used `cargo +nightly upgrade -Z minimal-versions` to
downgrade Cargo.lock to the minimal listed version for each
dependency. (I had to override a few packages; see .gitlab-ci.yml
for details).
Finally, I repeatedly increased the version of each of our
dependencies until our code compiled and the tests passed. Here's
what I found that we need:
anyhow >= 1.0.5: Earlier versions break our hyper example.
async-broadcast >= 0.3.2: Earlier versions fail our tests.
async-compression 0.3.5: Earlier versions handled futures and tokio
differently.
async-trait >= 0.1.2: Earlier versions are too buggy to compile our
code.
clap 2.33.0: For Arg::default_value_os().
coarsetime >= 0.1.20: exposed as_ticks() function.
curve25519-dalek >= 3.2: For is_identity().
generic-array 0.14.3: Earlier versions don't implement
From<&[T; 32]>
httparse >= 1.2: Earlier versions didn't implement Error.
itertools at 0.10.1: For at_most_once.
rusqlite >= 0.26.3: for backward compatibility with older rustc.
serde 1.0.103: Older versions break our code.
serde_json >= 1.0.50: Since we need its Value type to implement Eq.
shellexpand >= 2.1: To avoid a broken dirs crate version.
tokio >= 1.4: For Handle::block_on().
tracing >= 0.1.18: Previously, tracing_core and tracing had separate
LevelFilter types.
typenum >= 1.12: Compatibility with rust-crypto crates
x25519-dalek >= 1.2.0: For was_contributory().
Closes#275.
We are going to get rid of the Arc. Happily there is an id which is
always constructed uniquely and preserved by clone.
(auto-deref lets us make the function take &Self instead of &Arc)
Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>
This is a method, so the resolution is automatic. It's not clear to
me why this was written out this way, given that extend_ntor is right
above.
Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>
This closes arti#256. It makes our behavior match Tor's more closely,
though it has a simpler implementation than Tor. I think that the extra
complexity in Tor's logic is because we used to record timeouts in
the histogram as well as in the success/failure log.
The implementations of `Add` / `Sub` (et al.) on `std::time::Instant`
can panic if the underlying OS structure can't represent the result
(like arti#266). Use Instant::checked_add and print a warning instead,
to prevent panicking.
Also, we now add instead of subtracting; I suspect it's reasonable that
you might not be able to go backward past the first `Instant` created on
some platforms, but going *forward* should probably work?
IIUC, these anticipatd a need to store min_exit_circs_per_port in
CircMgr. But the current design, where it goes into preemptive.rs and
thence to usage, seems to work fine.
This required re-centralizing the configuration object for preemptive
circuits, since previously the settings from it were a bit spread out
over the crate.
And now the complexity begins: when the user changes the path_rules,
they not only want new circuits to obey those rules: they want
_all new requests_ to be put onto circuits that obey those rules.
That means that when the path rules become more restrictive, we need
to retire all the circuits, and make sure that currently pending
circuits aren't used for any requests.
If it's any comfort, doing this was even more complicated in C tor. ;)
This patch doesn't actually make anything reconfigurable, but it
does create an API that will tell you "you can't change the value of
that!" If the API looks reasonable, I can start making it possible
to change the values of individual items.
Previously we used Duration::mul_f64, which panics if its output is
out-of-range. That shouldn't actually be possible for the values
we're giving it, but probably it's better to just multiply in a safe
way.
This resolves a couple of XXXXs and therefore relates to #231.
We want to only use TODO in the codebase for non-blockers, and open
tickets for anything that is a bigger blocker than a TODO. These
XXXXs seem like definite non-blockers to me.
Part of arti#231.
Just as `in_same_family` is a member of Relay, so the function for
getting all the real family members of a relay should belong in the
same crate.
This change also removes the `family()` accessor: it gives the _claimed_ family rather
than the _acknlowedged_ family, and is therefore a bit dangerous.
There's still a hole in this logic; I've noted it in the Limitations
section. If we get a microdescriptor for a relay in between creating
and using the guard restriction, it might be omitted from the family
list.
There's not much reason to use a HashSet here, since we're just
going over the whole list.
This reverts commit 16e8489abb and does a little more
refactoring.
The new CircMgr::build_circuits_preemptively function actually causes
preemptive circuits to be built; it gets called from arti-client, like
the other daemon tasks the CircMgr has.