We need a type that holds a rend_handshake::IntroRequest object
internally, but where we don't materialize that object from the
Introduce2 message inside the MsgHandler, since that's more crypto
than we want to put in that task.
The hsdir selection algorithm for uploads and downloads is different
enough to justify splitting `hs_dirs` into 2 different functions.
More specifically, when selecting the relays to upload a service's
descriptors to, the service's `hsids` need to be matched up with the
correct `ring` (using the time period) before applying `select_nodes` to
pick the replicas. This is not the case when downloading, because
for downloads select relays from the current ring.
This duplicates some code from hsclient as noted in the comments;
it might be good to reduce this, but the remaining nontrivial
duplication is small, and the logic flow is slightly different
because of the two-step process.
This is a very small change that converts our Vec cheaply into a boxed
slice during program generation. Program generation speed shows no
changes, and there's no change when using compiled hashes, but is a
surprisingly effective 10% speedup to interpreted hash execution.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I was looking for ways to optimize out the many redundant capacity
checks in the Assembler. I didn't find any promising approaches, but
I also saw no evidence that it was an important bottleneck. (A simple
unsafe fix didn't improve any important metrics)
While I was in there, I tightened up the buffer size definitions for
both x86_64 and aarch64, and added assertions to test the limits we
set for the size of prologue, epilogue, and single instructions.
I kept some of the inlining and data type tweaks, even though benchmarks
show no difference. They seem like a step in the right direction, from
the disassembly at least.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This is a very simple change that avoids a surprising performance
pitfall: using the code() method on an enum from another crate
caused a non-inlined function call in code where we otherwise expect
a high level of compiler optimization. Replacing code() with a cast
to u8 avoids this function call and allows more intensive optimization
at the call site.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This hoists a few decisions out of the innermost portions of
choose_dst_reg, by moving what we can out of dst_register_allowed.
Wallclock time benchmarks:
generate-interp improves, -6.0%
Cachegrind benchmarks:
generate_interp_1000x, -5.0% instructions, -11.6% L2 access, -6% RAM
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I was trying to eliminate all the places where we copied a Program
(about 4100 bytes) except for the one final copy into a Box; but that
approach was proving too annoying. Even returning a Program via Result
will cause multiple unnecessary copies that don't optimize out.
This patch switches approaches, and instead allocates a Vec<Instruction>
presized to the correct capacity. This allocation is made as early as
possible and retained for the lifetime of the program if necessary.
This means we'll never avoid a heap allocation, but we can always
avoid extra copies and we don't need a separate Box for interpreted
programs.
Performance effects are subtle. Overall wallclock time doesn't change
much. Cachegrind shows some accesses moving up from RAM to L2 cache.
Using GDB to probe memcpy sizes shows that large (>1024b) memcpy are now
totally gone in the generate-interp test.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
Closer inspection of the CPU counters showed that the branching in
RegisterSet::index() was a big problem, contributing to the overall
CPU frontend stall bottleneck in program generation.
This new version is less general, and closer to the appraoch used by
the original C implementation. We store a sorted ArrayVec of in-set
registers, and most operations construct the RegisterSet only once
using a combined filter predicate.
Choosing a register from a set is now cheaper in branches, instructions,
and L1 cache space. We now very rarely manipulate an entire RegisterSet
in any way other than by selecting a register randomly. (Just for the
register R5 special case.)
Wallclock time benchmarks:
generate-interp improves, -7.0%
generate-x86_64 improves, -7.2%
Cachegrind benchmarks:
generate_interp_1000x, more total instructions run but a large
decrease in frontend cache misses. +4.6% instructions, +11% L1
accesses, -99% L2 access, -40% RAM access.
generate_compiled_100x, +4.0% instructions, +9.4% L1 access.
cache miss improvements: -57% L2 access, -25% RAM access.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
There was a special case in writer_pair_allowed for making add and
subtract equivalent. This patch changes RegisterWriter's encoding, using
per-opcode variants instead of per-format variants. The Add/Sub merge
can now happen earlier, when RegisterWriter is constructed.
Before and after RegisterWriter sizes are the same, at 8 bytes.
This patch removes many uses of Option<RegisterWriter> in favor
of using a new RegisterWriter::None default, and passes by value
rather than by reference.
Wallclock time benchmarks:
generate-interp improves, -7.5%
generate-x86_64 improves, -5.3%
Cachegrind benchmarks:
generate_interp_1000x, negligible change in total instructions,
improvement in cache footprint: -22.8% L2 accesses
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This uses the 'iai' crate and valgrind to measure fine grained cache
behavior during program generation and hash computation.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>