This duplicates some code from hsclient as noted in the comments;
it might be good to reduce this, but the remaining nontrivial
duplication is small, and the logic flow is slightly different
because of the two-step process.
This is a very small change that converts our Vec cheaply into a boxed
slice during program generation. Program generation speed shows no
changes, and there's no change when using compiled hashes, but is a
surprisingly effective 10% speedup to interpreted hash execution.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I was looking for ways to optimize out the many redundant capacity
checks in the Assembler. I didn't find any promising approaches, but
I also saw no evidence that it was an important bottleneck. (A simple
unsafe fix didn't improve any important metrics)
While I was in there, I tightened up the buffer size definitions for
both x86_64 and aarch64, and added assertions to test the limits we
set for the size of prologue, epilogue, and single instructions.
I kept some of the inlining and data type tweaks, even though benchmarks
show no difference. They seem like a step in the right direction, from
the disassembly at least.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This is a very simple change that avoids a surprising performance
pitfall: using the code() method on an enum from another crate
caused a non-inlined function call in code where we otherwise expect
a high level of compiler optimization. Replacing code() with a cast
to u8 avoids this function call and allows more intensive optimization
at the call site.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This hoists a few decisions out of the innermost portions of
choose_dst_reg, by moving what we can out of dst_register_allowed.
Wallclock time benchmarks:
generate-interp improves, -6.0%
Cachegrind benchmarks:
generate_interp_1000x, -5.0% instructions, -11.6% L2 access, -6% RAM
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I was trying to eliminate all the places where we copied a Program
(about 4100 bytes) except for the one final copy into a Box; but that
approach was proving too annoying. Even returning a Program via Result
will cause multiple unnecessary copies that don't optimize out.
This patch switches approaches, and instead allocates a Vec<Instruction>
presized to the correct capacity. This allocation is made as early as
possible and retained for the lifetime of the program if necessary.
This means we'll never avoid a heap allocation, but we can always
avoid extra copies and we don't need a separate Box for interpreted
programs.
Performance effects are subtle. Overall wallclock time doesn't change
much. Cachegrind shows some accesses moving up from RAM to L2 cache.
Using GDB to probe memcpy sizes shows that large (>1024b) memcpy are now
totally gone in the generate-interp test.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
Closer inspection of the CPU counters showed that the branching in
RegisterSet::index() was a big problem, contributing to the overall
CPU frontend stall bottleneck in program generation.
This new version is less general, and closer to the appraoch used by
the original C implementation. We store a sorted ArrayVec of in-set
registers, and most operations construct the RegisterSet only once
using a combined filter predicate.
Choosing a register from a set is now cheaper in branches, instructions,
and L1 cache space. We now very rarely manipulate an entire RegisterSet
in any way other than by selecting a register randomly. (Just for the
register R5 special case.)
Wallclock time benchmarks:
generate-interp improves, -7.0%
generate-x86_64 improves, -7.2%
Cachegrind benchmarks:
generate_interp_1000x, more total instructions run but a large
decrease in frontend cache misses. +4.6% instructions, +11% L1
accesses, -99% L2 access, -40% RAM access.
generate_compiled_100x, +4.0% instructions, +9.4% L1 access.
cache miss improvements: -57% L2 access, -25% RAM access.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
There was a special case in writer_pair_allowed for making add and
subtract equivalent. This patch changes RegisterWriter's encoding, using
per-opcode variants instead of per-format variants. The Add/Sub merge
can now happen earlier, when RegisterWriter is constructed.
Before and after RegisterWriter sizes are the same, at 8 bytes.
This patch removes many uses of Option<RegisterWriter> in favor
of using a new RegisterWriter::None default, and passes by value
rather than by reference.
Wallclock time benchmarks:
generate-interp improves, -7.5%
generate-x86_64 improves, -5.3%
Cachegrind benchmarks:
generate_interp_1000x, negligible change in total instructions,
improvement in cache footprint: -22.8% L2 accesses
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This uses the 'iai' crate and valgrind to measure fine grained cache
behavior during program generation and hash computation.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
For consistency with the other `ClientCirc` APIs,
`ClientCirc::allow_stream_requests` now takes a `HopNum` argument. Upon
receiving an incoming stream request, the reactor now checks if the
request came from the hop specified in `allow_stream_requests` (and if
it came from a different hop, the circuit is closed).
Part of #1009
The IptEstablisher needs to continuously maintain the IPT even as the
netdir is updated. Whereas, the IPT manager just wants to select the
relay from the netdir once and then only think about the relay
identity.
So it makes sense for the establisher to do necessary lookups of the
relay's ids in the netdir.
This was introduced in c82cda85d6
tor-basic-utils: DropNotifyWatchSender: use DropNotifyEofSignallable
and already, then, the is_eof() method is redundant.
At the very least, I need FatalError to be distinct:
IptEstablisher::new ought not to fail unless everything is terrible.
Add a the Spawn variant to FatalError (that we'll need soon) and the
Bug variant (which it seems likely we might need).
This also gets rid of the crate-level Result alias.
This code has most of what we need to go from an INTRODUCE2 message
we've just received to the point where we've connected to the
rendezvous point and we're waiting for a stream of BEGIN messages.
Unfinished pieces are marked with TODO HSS.
Most of #980.
We want to change the error return types of many methods, so we need a
way to name `std::result::Result`.
We could use `StdResult`, but, actually, properly distinguishing the
kinds of errors that can occur in various contexts means we don't
actually want a single Error type for the whole crate, so
`crate::Result` is going to go away.
This commit deprecates the `EncodeError::always_bug` function with a
`From<EncodeError> for Bug` trait, which is a more semantically correct
way to perform this action.
The IPT manager is going to want to separate the IptEstablisher
struct (which contains the Drop signal) from the watch receiver.
We could add an accessor to clone the watch, but the copy in the
IptEstablisher would be redundant.
This makes new()'s signature a bit funky but it's an internal method
so I think that's fine.