Prior to 0.11.0 we had cases where we would treat errors
as warnings: regretfully, this is still needed. This message
in particular has been widely reported, and it now causes
channel force closes.
Downgrade and log. I did insert some snarky log message earlier,
but hey, I'm sure CLN has done worse things to our peers!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: Protocol: treat LND "internal error" as warnings, not force close events (as we did in v0.10).
@whitslack complained of large CPU usage by connectd at startup;
I ran perf record on connectd on my machine (which sees a little spike, only)
and I see the cost of reading and discarding the entries:
```
- 95.52% 5.24% lightning_conne lightning_connectd [.] gossip_store_next
- 90.28% gossip_store_next
+ 40.27% tal_alloc_arr_
+ 22.78% tal_free
+ 11.74% crc32c
+ 9.32% fromwire_peektype
+ 4.10% __libc_pread64 (inlined)
1.70% be32_to_cpu
```
Much of this is caused by the search for our own gossip: keeping this separately
would be even better, but this fix is minimal.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: connectd: reduce initial CPU load when connecting to peers.
`peer_reconnected` was freeing a `struct peer_reconnected` instance
while a pointer to that instance was registered to be passed as an
argument to the `retry_peer_connected` callback function. This caused a
use-after-free crash when `retry_peer_connected` attempted to reparent
the instance to the temporary context.
Instead, never have `peer_reconnected` free a `struct peer_reconnected`
instance, and only ever allow such an instance to be freed after the
`retry_peer_connected` callback has finished with it. To ensure that the
instance is freed even if the connection is closed before the callback
can be invoked, parent the instance to the connection rather than to the
daemon.
Absent the need to free `struct peer_reconnected` instances outside of
the `retry_peer_connected` callback, there is no use for the
`reconnected` hashtable, so remove it as well.
See: https://github.com/ElementsProject/lightning/issues/5282#issuecomment-1141454255Fixes: #5282Fixes: #5284
Changelog-Fixed: connectd no longer crashes when peers reconnect.
This was fixed in 1c495ca5a8 ("connectd:
fix accidental handling of old reconnections.") and then reverted by
the rework in "connectd: avoid use-after-free upon multiple
reconnections by a peer".
The latter made the race much less likely, since we cleaned up the
reconnecting struct once the connection was hung up by the remote
node, but it's still theoretically possible.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This likely lead to a number of false errors when attempting to
route. We deemed a channel to be unusable as soon as either direction
isn't usable. This is bad since it excludes not only zeroconf
channels (which have different scids for the two directions), but it
also excludes any channel that we haven't seen an update from
yet. This was likely introduced when attemting to exclude nodes that
haven't sent a disable, but their peer has, but this is not necessary
as the unresponsive node would be marked as isolated by all its peers,
so we don't need to artificially mark a channel direction as disabled
when really we can't even enter the node to traverse the channel in
that direction.
Changelog-Fixed: routing: Fixed an issue where we would exclude the entire channel if either direction was disabled, or we hadn't seen an update yet.
When we moved gossip filtering to connectd, this aging got lost.
Without this, we hit the 10,000 entry limit before expiring full
gossip anti-echo cache. This is under 1M in allocations per peer, but
in DEVELOPER mode each allocation includes adds 3 notifiers (32 bytes
each) and a backtrace child (40 + 40 + 256 bytes), making it almost
10MB per peer, plus allocation overhead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: connectd: large memory usage with many peers fixed.
I have a test which reproduces this, too, and it's been seen in the
wild. It seems we can add a subd as we're closing, which causes
this assert to trigger.
Fixes: #5254
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.
Turns out, it was actually old reestablish messages! When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.
However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close. Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).
Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.
(I have a test which triggers this, but it also hangs the connect
command, due to other issues we will fix in the next release...)
Fixes: #5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When building reproducible build for Bionic:
```
Traceback (most recent call last):
File "/usr/local/bin/mrkd", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/mrkd/__init__.py", line 261, in main
result = mistune.markdown(fp.read(), inline=inline, renderer=renderer)
File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1856: ordinal not in range(128)
doc/Makefile:120: recipe for target 'doc/lightning-getinfo.7' failed
make: *** [doc/lightning-getinfo.7] Error 1
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We call out to connectd to activate the peer, and while we do that,
channel->owner is NULL. A better pattern would be to set up the unsaved
channel once connectd has given us the peer, but this works for now.
Fixes: #5204
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This seems to prevent broad propagation, due to LND not allowing it. See
https://github.com/lightningnetwork/lnd/issues/6432
We still announce it if you disable deprecated-apis, so tests still work,
and hopefully we can enable it in future.
Fixes: #5196
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: Protocol: disabled websocket announcement due to LND propagation issues
We have an explicit filter against redundant node_announcement
updates; we only allow 1 a week. This means that our change to force
a reannouncement every 24 hours did not work!
Allow once a day, instead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We seem to have made node_announcement propagation *worse*, not
better. Explorers don't see my nodes updates.
At least some LND nodes never send us timestamp_filter, so we are
never actually stream *any* gossip. We should send gossip about
ourselves, even if they haven't set a filter (yet).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we more aggressively send our own gossip, to improve propagation chances.
This attempted to make us re-xmit our own node_announcement at restart,
by moving the node_announcement to the end of the gossip store. But,
as nothing is connected, yet, this had no effect!
We will rexmit it anyway, since it's marked PUSH.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I have no idea why someone else owns the directory suddenly, but all git
commands fail. Workaround as suggested by the error message.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I was seeing a strange crash:
Connectd gave bad CONNECT_PEER_CONNECTED message
The message is indeed mangled, around the remote_addr!
A quick review of the code revealed that we were not making a copy
when it was a reconnect, and so the remote_addr pointer was pointing
to memory which was freed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We now have ternary outcomes for `Builder.configure()` and
`Builder.start()`:
- Ok(Some(p)) means we were configured correctly, and can continue
with our work normally
- Ok(None) means that `lightningd` was invoked with `--help`, we
weren't configured (which is not an error since the `lightningd` just
implicitly told us to shut down) and user code should clean up and
exit as well
- Err(e) something went wrong, user code may report an error and exit.
The relative path makes for a difficult experience when people are reading on `https://lightning.readthedocs.io/`. Directly linking saves the reader a few clicks hunting down the correct location :)