vincenzopalazzo/arti - arti - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Ian Jackson	c593b861a7	hashx_cachegrind: Fix black_box location	2023-08-23 10:36:33 +01:00
Ian Jackson	f123200d0c	hashx_cachegrind: Make mk_c_equix not shuffle the HashX	2023-08-23 10:36:33 +01:00
Ian Jackson	22d2ed9c98	hashx_cachegrind: Make mk_rust not shuffle the HashXBuilder	2023-08-23 10:36:33 +01:00
Ian Jackson	1f31ffd563	hashx_cachegrind: Introduce C_HASHX_OK alias	2023-08-23 10:36:33 +01:00
Ian Jackson	ad59bfd28d	hashx_cachegrind: Introduce bench_loop helper macro The macro generates similar but not identical code. There are new bindings.	2023-08-23 10:36:33 +01:00
Ian Jackson	71124ef351	hashx_cachegrind: Introduce u32be helper function This is going to be more obviously useful in a moment.	2023-08-23 10:36:33 +01:00
Ian Jackson	7e1d0b7f45	hashx_cachegrind: Introduce mk_c_equix helper macro The macro generates precisely the existing code.	2023-08-23 10:36:33 +01:00
Ian Jackson	e7765ce969	hashx_cachegrind: Introduce mk_rust helper macro The macro generates precisely the existing code.	2023-08-23 10:36:33 +01:00
Micah Elizabeth Scott	26b5ae9a3c	hashx: Use a boxed slice for Program storage This is a very small change that converts our Vec cheaply into a boxed slice during program generation. Program generation speed shows no changes, and there's no change when using compiled hashes, but is a surprisingly effective 10% speedup to interpreted hash execution. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-21 15:27:28 -07:00
Micah Elizabeth Scott	52dbdbea3b	hashx: Assembly buffer sizing and tidying I was looking for ways to optimize out the many redundant capacity checks in the Assembler. I didn't find any promising approaches, but I also saw no evidence that it was an important bottleneck. (A simple unsafe fix didn't improve any important metrics) While I was in there, I tightened up the buffer size definitions for both x86_64 and aarch64, and added assertions to test the limits we set for the size of prologue, epilogue, and single instructions. I kept some of the inlining and data type tweaks, even though benchmarks show no difference. They seem like a step in the right direction, from the disassembly at least. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-21 15:27:28 -07:00
Micah Elizabeth Scott	acf598e785	hashx: avoid surprising overhead of enum code() method This is a very simple change that avoids a surprising performance pitfall: using the code() method on an enum from another crate caused a non-inlined function call in code where we otherwise expect a high level of compiler optimization. Replacing code() with a cast to u8 avoids this function call and allows more intensive optimization at the call site. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-21 15:27:28 -07:00
Micah Elizabeth Scott	0af908bcf2	hashx: Rearrange destination register validator for performance This hoists a few decisions out of the innermost portions of choose_dst_reg, by moving what we can out of dst_register_allowed. Wallclock time benchmarks: generate-interp improves, -6.0% Cachegrind benchmarks: generate_interp_1000x, -5.0% instructions, -11.6% L2 access, -6% RAM Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-21 15:27:28 -07:00
Micah Elizabeth Scott	ceacd5c988	hashx: New approach to avoid memcpy in Program I was trying to eliminate all the places where we copied a Program (about 4100 bytes) except for the one final copy into a Box; but that approach was proving too annoying. Even returning a Program via Result will cause multiple unnecessary copies that don't optimize out. This patch switches approaches, and instead allocates a Vec<Instruction> presized to the correct capacity. This allocation is made as early as possible and retained for the lifetime of the program if necessary. This means we'll never avoid a heap allocation, but we can always avoid extra copies and we don't need a separate Box for interpreted programs. Performance effects are subtle. Overall wallclock time doesn't change much. Cachegrind shows some accesses moving up from RAM to L2 cache. Using GDB to probe memcpy sizes shows that large (>1024b) memcpy are now totally gone in the generate-interp test. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-21 15:27:28 -07:00
Micah Elizabeth Scott	ee6acfa5cd	hashx: Rewrite RegisterSet again to reduce CPU frontend stalls Closer inspection of the CPU counters showed that the branching in RegisterSet::index() was a big problem, contributing to the overall CPU frontend stall bottleneck in program generation. This new version is less general, and closer to the appraoch used by the original C implementation. We store a sorted ArrayVec of in-set registers, and most operations construct the RegisterSet only once using a combined filter predicate. Choosing a register from a set is now cheaper in branches, instructions, and L1 cache space. We now very rarely manipulate an entire RegisterSet in any way other than by selecting a register randomly. (Just for the register R5 special case.) Wallclock time benchmarks: generate-interp improves, -7.0% generate-x86_64 improves, -7.2% Cachegrind benchmarks: generate_interp_1000x, more total instructions run but a large decrease in frontend cache misses. +4.6% instructions, +11% L1 accesses, -99% L2 access, -40% RAM access. generate_compiled_100x, +4.0% instructions, +9.4% L1 access. cache miss improvements: -57% L2 access, -25% RAM access. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-21 15:27:28 -07:00
Micah Elizabeth Scott	e142fd9882	hashx: new RegisterWriter format handles more cases transparently There was a special case in writer_pair_allowed for making add and subtract equivalent. This patch changes RegisterWriter's encoding, using per-opcode variants instead of per-format variants. The Add/Sub merge can now happen earlier, when RegisterWriter is constructed. Before and after RegisterWriter sizes are the same, at 8 bytes. This patch removes many uses of Option<RegisterWriter> in favor of using a new RegisterWriter::None default, and passes by value rather than by reference. Wallclock time benchmarks: generate-interp improves, -7.5% generate-x86_64 improves, -5.3% Cachegrind benchmarks: generate_interp_1000x, negligible change in total instructions, improvement in cache footprint: -22.8% L2 accesses Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-21 15:27:28 -07:00
Micah Elizabeth Scott	780e10e1d5	hashx/bench: Add cachegrind microbenchmarks This uses the 'iai' crate and valgrind to measure fine grained cache behavior during program generation and hash computation. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-18 20:09:40 -07:00
Nick Mathewson	cec6d0ce33	Run add_warnings on all files.	2023-08-04 07:45:04 -04:00
Nick Mathewson	5c607e8cf6	Merge branch 'ticket889_fuzz' into 'main' Fuzzers for Equi-X and HashX See merge request tpo/core/arti!1459	2023-08-02 22:12:20 +00:00
Micah Elizabeth Scott	bf0119fbfe	hashx/fuzz, equix/fuzz: use arti-corpora Remove corpus from .gitignore and add a symlink to the corpora submodule.	2023-08-01 19:34:04 -07:00
Micah Elizabeth Scott	d7c2e7996e	hashx/fuzz: update tor-c-equix dependency my cargo_hashx_rng branch was just merged into main (thanks dgoulet!)	2023-08-01 19:34:04 -07:00
Micah Elizabeth Scott	810adcc50d	hashx/fuzz: Comments, explain our 'seed' input In response to review feedback, explain that 'seed' here is more for compatibility and convenience and not central to our goal of fuzzing the program generator. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-01 19:34:04 -07:00
Micah Elizabeth Scott	5a85749e48	hashx/fuzz: Simplify, remove rayon dependency Review feedback is that we don't want parallelism here. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-01 19:34:03 -07:00
Micah Elizabeth Scott	ea2716595e	hashx/fuzz: Start a cross-implementation fuzzer for HashX Fuzz testing for HashX. Uses a hook into the pseudorandom number stream to test the program generator deeply on input that can be mutated by the fuzzer. Confirms program generation by running a small number of arbitrary test hashes, so we don't need to understand the implementation-specific program format to test the program generator. We test four implementations in parallel this way, the compiled and interpreted implementations included in both this crate and c-tor. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-01 19:34:03 -07:00
Micah Elizabeth Scott	550d89fb57	hashx/bench: Shared generate wrapper for u64-hash and full-hash Code cleanup from review feedback Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-01 19:28:46 -07:00
Micah Elizabeth Scott	315122f159	hashx/bench, equix/bench: Enable debug symbols Propagates this setting from the outer Cargo.toml to the new benchmark crates, since they no longer get the setting by being included in the main workspace. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-01 19:28:46 -07:00
Micah Elizabeth Scott	256e5de9e5	hashx/bench, equix/bench: check in matching Cargo.lock files It might be useful to keep these locked down for benchmark reproducibility. Currently the hashx and equix crates are fully separate. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-01 19:28:46 -07:00
Micah Elizabeth Scott	ae58ea0697	equix, hashx: Benchmark against C implementation This is a small batch of improvements for the equix and hashx benchmarks. The headline feature is that we are now including the C implementations (slightly modified from tevador's, hosted as part of c-tor) and using them in apples-to-apples comparisons. Minor features: - Benchmarks moved to new nested crates, preventing their dependencies from spilling into the main workspace build. - Tests are now grouped - We also test the performance of memory reuse where possible - Code cleanup for per-runtime options These benchmark builds will now automatically pull in the c-tor git repo and build portions of it with a Rust wrapper. This uses the 'cc' and 'bindgen' crates, so it requires a C compiler and libclang on the host system. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-08-01 19:28:43 -07:00
Micah Elizabeth Scott	9257949b80	equix, hashx: Additional comment tweaks More review feedback. Thanks nickm! Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 13:17:23 -07:00
Micah Elizabeth Scott	fdba82100f	equix, hashx: Prepare for an initial LGPL release This replaces the 'TODO' marker from earlier commits, using tevador's copyright and license (LGPL 3.0 only) for the hashx and equix crates. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	4affddaa08	tor-hspow, equix, hashx: Comment tweaks Making a few comment tweaks suggested in review feedback. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	7579febb46	hashx: Simplify RegisterWriterMap I originally wrote this in an overcomplicated way, to avoid frequent initialization of a RegisterWriter array. It turns out that RegisterWriter can be fairly compact, so this extra level of indirection isn't necessary or measurably helpful. This still manages to avoid declaring RegisterWriter as Copy, by using Default to initialize the array instead of an array constructor. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	f4939a5fd8	tor-hspow, equix, hashx: Make all error types Clone This uses an Arc to hold std::io::Error for low-level HashX runtime errors. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	10b7352c98	hashx: Simplify hash_to_bytes, only support fixed output width In response to review feedback. The byte output is only needed for unit tests right now, since Equi-X uses u64 output exclusively. The optimization for shorter output widths can shave tiny amounts of time off hash benchmarks, but in this case it's more helpful to avoid introducing APIs that offer parameters with incomplete compile-time range checking. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	d17c12b152	hashx: use RngCore for HashX's internal PRNG This refactors the random number generator used within HashX's program generator so that it uses the rand::RngCore trait. The basic SipHash powered u64 generator now implements RngCore, while a buffer layer wraps this and provides u8 and u32 values as needed by the generator. Some of this new RngCore layer is now exposed to the hashx crate's public API. The intent is to allow external code to test, benchmark, or fuzz the program generator by supplying its own random number stream. Benchmarks show a small but confusing performance improvement associated with this patch. About a 2% improvement in generation. This could be due to the Rng changes. No change in compiled hash execution performance. Even though this patch only touches program generation, benchmarks show a 4% speedup in interpreted execution. This seems most likely explained by instruction cache effects, but I'm not sure. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	2c20b46921	hashx: Implement Default for RuntimeOption	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	8a79021f38	Update equix, hashx, tor-hspow for new clippy defaults Just running maint/add_warning after the rebase	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	fdfe3ce55f	hashx: register set optimizations, 20% faster generator I was hoping most of the program generator would get inlined, so we can resolve a lot of the edge cases at compile-time. This patch gets us close to that, adding many inline attrs and rewriting RegisterSet with explicit unrolling and storage types that are easier for the optimizer to reason about. From the disassembly of the program generator, it's now mostly one big function with a jump table. From callgrind instruction profiles, there are no longer obvious hotspots in register set scanning loops. It also looks like we're often keeping per-register schedule information all loaded into machine registers now. Keeping the Rng entry points non-inlined for now seems to be slightly better, by a percent or two. There's some work left to do in compiled programs, and maybe room for improvement in the Program representation too. That will be in a future patch. Benchmark shows about 20% improvement on my machine, generate-interp time: [75.440 µs 75.551 µs 75.684 µs] change: [-24.083% -23.775% -23.483%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 5 (5.00%) high mild 6 (6.00%) high severe generate-x86_64 time: [96.068 µs 96.273 µs 96.540 µs] change: [-18.699% -18.381% -18.013%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:14 -07:00
Micah Elizabeth Scott	a8756f2bce	Reimplement HashX in Rust This is a new pure Rust implementation of the HashX algorithm designed by tevador for Tor's onion service proof of work puzzle v1. HashX is a lightweight family of randomly generated hash functions. A seed, via blake2 and siphash, drives a program generation model which randomly selects opcodes and registers while following some constraints that avoid timing stalls or insufficient hash mixing. The execution of these hash funcions can be done using a pure Rust interpreter, or about 20x faster using a very simple just in time compiler based on the dynasm assembler crate. This has been implemented for x86_64 and aarch64. Signed-off-by: Micah Elizabeth Scott <beth@torproject.org>	2023-07-27 07:20:06 -07:00

38 Commits