This is a very small change that converts our Vec cheaply into a boxed
slice during program generation. Program generation speed shows no
changes, and there's no change when using compiled hashes, but is a
surprisingly effective 10% speedup to interpreted hash execution.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I was looking for ways to optimize out the many redundant capacity
checks in the Assembler. I didn't find any promising approaches, but
I also saw no evidence that it was an important bottleneck. (A simple
unsafe fix didn't improve any important metrics)
While I was in there, I tightened up the buffer size definitions for
both x86_64 and aarch64, and added assertions to test the limits we
set for the size of prologue, epilogue, and single instructions.
I kept some of the inlining and data type tweaks, even though benchmarks
show no difference. They seem like a step in the right direction, from
the disassembly at least.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This is a very simple change that avoids a surprising performance
pitfall: using the code() method on an enum from another crate
caused a non-inlined function call in code where we otherwise expect
a high level of compiler optimization. Replacing code() with a cast
to u8 avoids this function call and allows more intensive optimization
at the call site.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This hoists a few decisions out of the innermost portions of
choose_dst_reg, by moving what we can out of dst_register_allowed.
Wallclock time benchmarks:
generate-interp improves, -6.0%
Cachegrind benchmarks:
generate_interp_1000x, -5.0% instructions, -11.6% L2 access, -6% RAM
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I was trying to eliminate all the places where we copied a Program
(about 4100 bytes) except for the one final copy into a Box; but that
approach was proving too annoying. Even returning a Program via Result
will cause multiple unnecessary copies that don't optimize out.
This patch switches approaches, and instead allocates a Vec<Instruction>
presized to the correct capacity. This allocation is made as early as
possible and retained for the lifetime of the program if necessary.
This means we'll never avoid a heap allocation, but we can always
avoid extra copies and we don't need a separate Box for interpreted
programs.
Performance effects are subtle. Overall wallclock time doesn't change
much. Cachegrind shows some accesses moving up from RAM to L2 cache.
Using GDB to probe memcpy sizes shows that large (>1024b) memcpy are now
totally gone in the generate-interp test.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
Closer inspection of the CPU counters showed that the branching in
RegisterSet::index() was a big problem, contributing to the overall
CPU frontend stall bottleneck in program generation.
This new version is less general, and closer to the appraoch used by
the original C implementation. We store a sorted ArrayVec of in-set
registers, and most operations construct the RegisterSet only once
using a combined filter predicate.
Choosing a register from a set is now cheaper in branches, instructions,
and L1 cache space. We now very rarely manipulate an entire RegisterSet
in any way other than by selecting a register randomly. (Just for the
register R5 special case.)
Wallclock time benchmarks:
generate-interp improves, -7.0%
generate-x86_64 improves, -7.2%
Cachegrind benchmarks:
generate_interp_1000x, more total instructions run but a large
decrease in frontend cache misses. +4.6% instructions, +11% L1
accesses, -99% L2 access, -40% RAM access.
generate_compiled_100x, +4.0% instructions, +9.4% L1 access.
cache miss improvements: -57% L2 access, -25% RAM access.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
There was a special case in writer_pair_allowed for making add and
subtract equivalent. This patch changes RegisterWriter's encoding, using
per-opcode variants instead of per-format variants. The Add/Sub merge
can now happen earlier, when RegisterWriter is constructed.
Before and after RegisterWriter sizes are the same, at 8 bytes.
This patch removes many uses of Option<RegisterWriter> in favor
of using a new RegisterWriter::None default, and passes by value
rather than by reference.
Wallclock time benchmarks:
generate-interp improves, -7.5%
generate-x86_64 improves, -5.3%
Cachegrind benchmarks:
generate_interp_1000x, negligible change in total instructions,
improvement in cache footprint: -22.8% L2 accesses
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This uses the 'iai' crate and valgrind to measure fine grained cache
behavior during program generation and hash computation.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
In response to review feedback, explain that 'seed' here is more
for compatibility and convenience and not central to our goal of
fuzzing the program generator.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
Fuzz testing for HashX. Uses a hook into the pseudorandom number
stream to test the program generator deeply on input that can
be mutated by the fuzzer. Confirms program generation by running
a small number of arbitrary test hashes, so we don't need to
understand the implementation-specific program format to test the
program generator.
We test four implementations in parallel this way, the compiled and
interpreted implementations included in both this crate and c-tor.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
Propagates this setting from the outer Cargo.toml to the new
benchmark crates, since they no longer get the setting by
being included in the main workspace.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
It might be useful to keep these locked down for benchmark
reproducibility. Currently the hashx and equix crates are
fully separate.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This is a small batch of improvements for the equix and hashx
benchmarks. The headline feature is that we are now including
the C implementations (slightly modified from tevador's, hosted
as part of c-tor) and using them in apples-to-apples comparisons.
Minor features:
- Benchmarks moved to new nested crates, preventing their
dependencies from spilling into the main workspace build.
- Tests are now grouped
- We also test the performance of memory reuse where possible
- Code cleanup for per-runtime options
These benchmark builds will now automatically pull in the c-tor
git repo and build portions of it with a Rust wrapper. This uses
the 'cc' and 'bindgen' crates, so it requires a C compiler and
libclang on the host system.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This replaces the 'TODO' marker from earlier commits, using tevador's
copyright and license (LGPL 3.0 only) for the hashx and equix crates.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I originally wrote this in an overcomplicated way, to avoid
frequent initialization of a RegisterWriter array. It turns out
that RegisterWriter can be fairly compact, so this extra level of
indirection isn't necessary or measurably helpful.
This still manages to avoid declaring RegisterWriter as Copy, by
using Default to initialize the array instead of an array constructor.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
In response to review feedback. The byte output is only needed
for unit tests right now, since Equi-X uses u64 output exclusively.
The optimization for shorter output widths can shave tiny amounts of
time off hash benchmarks, but in this case it's more helpful to avoid
introducing APIs that offer parameters with incomplete compile-time
range checking.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This refactors the random number generator used within HashX's program
generator so that it uses the rand::RngCore trait. The basic SipHash
powered u64 generator now implements RngCore, while a buffer layer
wraps this and provides u8 and u32 values as needed by the generator.
Some of this new RngCore layer is now exposed to the hashx crate's
public API. The intent is to allow external code to test, benchmark, or
fuzz the program generator by supplying its own random number stream.
Benchmarks show a small but confusing performance improvement
associated with this patch. About a 2% improvement in generation.
This could be due to the Rng changes. No change in compiled hash
execution performance. Even though this patch only touches program
generation, benchmarks show a 4% speedup in interpreted execution.
This seems most likely explained by instruction cache effects,
but I'm not sure.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
I was hoping most of the program generator would get inlined, so we can
resolve a lot of the edge cases at compile-time. This patch gets us
close to that, adding many inline attrs and rewriting RegisterSet with
explicit unrolling and storage types that are easier for the optimizer
to reason about.
From the disassembly of the program generator, it's now mostly one big
function with a jump table. From callgrind instruction profiles, there
are no longer obvious hotspots in register set scanning loops. It also
looks like we're often keeping per-register schedule information all
loaded into machine registers now.
Keeping the Rng entry points non-inlined for now seems to be slightly
better, by a percent or two.
There's some work left to do in compiled programs, and maybe room for
improvement in the Program representation too. That will be in a future
patch.
Benchmark shows about 20% improvement on my machine,
generate-interp time: [75.440 µs 75.551 µs 75.684 µs]
change: [-24.083% -23.775% -23.483%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
generate-x86_64 time: [96.068 µs 96.273 µs 96.540 µs]
change: [-18.699% -18.381% -18.013%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>
This is a new pure Rust implementation of the HashX algorithm
designed by tevador for Tor's onion service proof of work puzzle v1.
HashX is a lightweight family of randomly generated hash functions.
A seed, via blake2 and siphash, drives a program generation model
which randomly selects opcodes and registers while following some
constraints that avoid timing stalls or insufficient hash mixing.
The execution of these hash funcions can be done using a pure Rust
interpreter, or about 20x faster using a very simple just in time
compiler based on the dynasm assembler crate. This has been
implemented for x86_64 and aarch64.
Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>