shithub: neoventi

Download patch

ref: b3f5bf725a5e93719b34c552a15ae9103548a4c2
parent: a2ffdf6dfea9b0128732ca068a84654328b8a038
author: Noam Preil <noam@pixelhero.dev>
date: Tue Aug 13 11:45:33 EDT 2024

update notebook

--- a/notebook
+++ b/notebook
@@ -1708,3 +1708,416 @@
 on the other hand, evictions should be relatively rare, since I plan on S3-FIFOing.
 
 Also yeah that totally fixed it :D
+
+...also I can TOTALLY make walk of /n/vac faster, I just took the wrong approach! I need to rewrite vacfs with a new libventi next ;P
+
+With everything cached in the neoventi sied, I'm seeing 26 seconds for a full read??? Coulda sworn it was like five, but different image I think. and reform instead of thinkpad, though this is... generally comparable performance-wise, so I'm uncertain. It'll need further testing.
+
+regardless, seeing six seconds of CPU usage there, which can likely be cut down - and can probably eliminate the network roundtrips by doing some explicit in-fs prefetching to cut it down even further.
+
+...worth comparing against venti again :P
+
+neoventi: 69s cold, 26s warm.
+venti:    37.17s cold??? it's winning?! 29 s warm. okay.
+
+_sob_
+
+oh wait this is off of /root, so it's not fully cold. right.
+
+That's _concerningly_ close. What happened, neoventi?? Is it because reform??
+
+...oh my gosh, this is with profiling enabled :O
+
+...identical numbers without, though. the hell?
+
+This might just be hardware, since it's only using one core? IF there's latency bubbles, this might be amplified way beyond...
+
+Of the 69s for a cold read, I'm only seeing 1.06s of CPU time, of which >80% is decompression.
+
+CPU usage on a warm run is basically identical; neoventi only needs one second of CPU time for this. That's not terrible at all.
+
+doesn't explain why a full core is at 100% during this process, though. vacfs isn't using that much either...
+
+probably.
+
+ % tprof 49587
+total: 6420
+
+six seconds. So, between neoventi and vacfs, that's only seven seconds of CPU time?? ...walk itself, maybe??
+
+not going to try to write prof to its ctl quickly enough (though it'd be easy with a while(! ps | awk...)), just goint to link walk with -p...
+
+might need to tprof, getting a prof error??
+
+% time walk /n/vac > /dev/null
+1 Prof errors
+12.46u 3.62s 27.04r 	 walk /n/vac
+
+a prof... file was generated, but is invalid apparently??
+
+file system isn't even close to full, so that's not it. ah well.
+
+WALKLOOP:
+
+rm /sys/log/walk.prof ; while(! ps | grep -s walk)sleep 0.01; pid=`{ps | grep walk | awk '{print($2)}'};echo profile > /proc/$pid/ctl;while(tprof $pid >> /sys/log/walk.prof){ sleep 0.5}
+
+...um.
+
+total: 12760
+TEXT 00010000
+    ms      %   sym
+ 11460	 89.8	seen
+   220	  1.7	_profin
+   200	  1.5	_profout
+   110	  0.8	_callpc
+   110	  0.8	strchr
+    80	  0.6	convM2D
+    70	  0.5	treesplay
+    60	  0.4	memcpy
+    40	  0.3	_restore
+    30	  0.2	pooldel
+    30	  0.2	blocksetdsize
+    30	  0.2	blockcheck
+    20	  0.1	D2B
+    20	  0.1	treelookupgt
+    20	  0.1	Bputc
+    20	  0.1	poolfreel
+    20	  0.1	dirpackage
+    20	  0.1	unlock
+    20	  0.1	walk
+    20	  0.1	statcheck
+    20	  0.1	close
+    10	  0.0	pooladd
+    10	  0.0	dofile
+    10	  0.0	memset
+    10	  0.0	canlock
+    10	  0.0	trim
+    10	  0.0	punlock
+    10	  0.0	open
+    10	  0.0	create
+    10	  0.0	dirread
+    10	  0.0	plock
+    10	  0.0	fullrune
+    10	  0.0	poolallocl
+    10	  0.0	memccpy
+    10	  0.0	pread
+
+walk is ~13s of CPU time?!
+
+oh right wait that's with -p lol
+
+ummmm
+
+total: 12130
+TEXT 00010000
+    ms      %   sym
+ 11530	 95.0	seen
+   120	  0.9	memmove
+    70	  0.5	convM2D
+    50	  0.4	blockcheck
+    40	  0.3	dirpackage
+    40	  0.3	walk
+    40	  0.3	treesplay
+    20	  0.1	getdsize
+    20	  0.1	poolfreel
+    20	  0.1	blocksetdsize
+    20	  0.1	pooldel
+    20	  0.1	treelookupgt
+    20	  0.1	strchr
+    20	  0.1	s_append
+    10	  0.0	poolalloc
+    10	  0.0	s_terminate
+    10	  0.0	Bwrite
+    10	  0.0	s_free
+    10	  0.0	dirread
+    10	  0.0	dsize2bsize
+    10	  0.0	poolrealloc
+    10	  0.0	pread
+    10	  0.0	_tas
+    10	  0.0	blockmerge
+
+okay so the profiler is shockingly efficient, tbh. expected way worse
+
+but also walk is 12s of CPU time?!
+
+What if I rip out that code and have it walk directories even if it's seen them already? Dangerous for recursive hierarchies, but probably fine for this...
+
+even worse lol, of course.
+
+also, for future reference, /n/vac/archive/2023/0707/usr/glenda/src/clone/.git is legitimately damaged on disk. both venti and neoventi yield the same error, e.g. 
+
+walk: couldn't open /n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9: file does not exist: '/n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9'
+
+Going to need a more efficient seen() implementation?
+
+Two easy ideas: switch it to fibbonacci hashing for picking a cache bucket, and bumping the number of buckets from 256 to e.g. 2048...
+
+Interrupted that walk at 300s lol. Much worse without seen() :P
+
+Testing with 2048 buckets... ~16.4s haha, that's almost a 50% reduction >_<
+
+Obvious question: what about 16384 buckets? lol
+
+also, of note: this cut the warm venti time to 19s, too!
+
+64K buckets, neoventi warm: ~15s, too. Dropping that to 16K seems good. Overhead is on the order of eight bytes per bucket, so this should be less than 1M of additional RAM usage, and - bonus - I suspect that this scales better since a larger number of smaller buckets, with conservative allocation, should end up wasting less space, given that buckets start with no space allocated. That's conjecture though.
+
+Regardless, it's nearly twice as fast!
+
+it also extended neoventi's lead a bit, I think? venti went from 29 to 19, neoventi went from 26 to 15 ;P
+
+walk uses `dir->qid.path^dir->qid.vers` as the key to the cache table. That's... i have questions about how uniformly distributed this will be. let's try fibbonaccizing it.
+
+_crosses fingers..._
+
+//
+With 4096 buckets, it's... still the same, just worse. Neato.
+
+8K buckets seems sufficient.
+
+Can fuck with bucket growth rate? or initial capacity? (to reduce reallocs/memmoves)
+
+no clear impact, which makes sense; 95% of runtime is in seen() itself, not memory management / shuffling.
+
+Could try caching just by QID, and then subcaching by version? An extra layer of indirection, but fewer comparisons, in principle.
+
+Actually, let's just try removing version from bucket selection...
+
+...also, oops, may have fucked up networking badly by unplugging phone while usb/ether was active?? looks like that may be causing vacfs to hang??
+
+aux/dial is hanging dialing 127.1!14011, I think it may be trying to use the nonexistent phone for it :/
+
+also fossil might be hanging. um.
+
+I ran a `fsys main sync` and... it can't reach venti, maybe???
+
+home fs is fine. I'm going to save this and shut down. Well, after a minute. Can't run a proper sync :(
+
+Sanity test: is everything good?
+
+% sync
+main...
+prompt: srv: already serving 'fscons'
+srv -A boot
+prompt: archive vac:10d05a5ed61e46b309dc374daaccc362dbb48783
+archive vac:c28717a76c153eed834d684c7eff35814cb25b4c
+fsys main sync
+	main sync: wrote 2 blocks
+done
+home...
+sys home sync
+	home sync: wrote 1 blocks
+done
+
+Okay, cool.
+
+% mk fg
+./7.neoventi -c /dev/jasnah/arenas
+Initializing neoventi build 5... TODO: validate initial stateinitialized, launching server...overridding tcp address for simultaneous testing! tcp!*!14011...
+
+% mk vac && walk /n/vac
+% mk vac && walk /n/vac
+% mk vac && walk /n/vac
+% mk vac && walk /n/vac
+
+Got four walks running and it seems all fine. CPU load is at maybe 50%? Should be better, could be a lot worse.
+
+Should import ori's walk patch.
+
+(All four walks finished without issue :)
+
+Walk seems to be using ~1s of CPU time at the moment. I suspect that'll be even better with ori's approach, since it'll reduce the number of entries we're keeping; we also won't need so many buckets there except when we _want_ full deduplication.
+
+total: 990
+TEXT 00010000
+    ms      %   sym
+   380	 38.3	seen
+   120	 12.1	convM2D
+    80	  8.0	memmove
+    40	  4.0	pread
+    40	  4.0	blockcheck
+    30	  3.0	Bwrite
+    30	  3.0	treesplay
+    20	  2.0	poolreallocl
+    20	  2.0	dofile
+    20	  2.0	poolallocl
+    20	  2.0	open
+    20	  2.0	close
+    20	  2.0	dirpackage
+    20	  2.0	strchr
+    20	  2.0	strlen
+    10	  1.0	strcmp
+    10	  1.0	s_newalloc
+    10	  1.0	memset
+    10	  1.0	_tas
+    10	  1.0	walk
+    10	  1.0	pooladd
+    10	  1.0	pooldel
+    10	  1.0	poolcompactl
+    10	  1.0	s_free
+    10	  1.0	blocksetdsize
+    10	  1.0	dsize2bsize
+
+Okay, what about vacfs?
+
+total: 6340
+TEXT 00010000
+    ms      %   sym
+   670	 10.5	_sha1block
+   290	  4.5	treesplay
+   260	  4.1	blocksetdsize
+   250	  3.9	blockcheck
+   230	  3.6	lock
+   190	  2.9	trim
+   180	  2.8	pooladd
+   170	  2.6	pooldel
+   170	  2.6	memcmp
+   150	  2.3	blockmerge
+   150	  2.3	_tas
+   130	  2.0	memset
+   130	  2.0	vdunpack
+   120	  1.8	dsize2bsize
+   120	  1.8	memmove
+   120	  1.8	poolallocl
+   120	  1.8	vtentryunpack
+   110	  1.7	unlock
+   110	  1.7	poolfree
+   110	  1.7	poolfreel
+    90	  1.4	strchr
+    90	  1.4	plock
+    80	  1.2	vtcacheglobal
+    80	  1.2	D2B
+    80	  1.2	convD2M
+    80	  1.2	treelookupgt
+    80	  1.2	blocksetsize
+    70	  1.1	vtblockput
+    70	  1.1	io
+    60	  0.9	qunlock
+    60	  0.9	mallocz
+    60	  0.9	newfid
+    60	  0.9	punlock
+    50	  0.7	packettrim
+    50	  0.7	pread
+    50	  0.7	vdcopy
+    40	  0.6	vacfilewalk
+    40	  0.6	_barrier
+    40	  0.6	pwrite
+    40	  0.6	vtmalloc
+    40	  0.6	getdsize
+    40	  0.6	stringunpack
+    40	  0.6	upheap
+    40	  0.6	vtfileunlock
+    40	  0.6	convM2S
+    40	  0.6	vtfileblock
+    30	  0.4	poolalloc
+    30	  0.4	vtfilelock2
+    30	  0.4	packetsplit
+    30	  0.4	meunpack
+    30	  0.4	memalloc
+    30	  0.4	setmalloctag
+    30	  0.4	heapdel
+    30	  0.4	vacdirread
+    30	  0.4	pqid
+    20	  0.3	vderead
+    20	  0.3	direntrysize
+    20	  0.3	permf
+    20	  0.3	rwalk
+    20	  0.3	runlock
+    20	  0.3	B2D
+    20	  0.3	mbresize
+    20	  0.3	packetsize
+    20	  0.3	fileloadblock
+    20	  0.3	_vtrecv
+    20	  0.3	checksize
+    20	  0.3	vtcachebumpblock
+    20	  0.3	strcmp
+    20	  0.3	setrealloctag
+    20	  0.3	free
+    20	  0.3	vtparsescore
+    20	  0.3	_vtsend
+    20	  0.3	blockwalk
+    20	  0.3	downheap
+    10	  0.1	vdeclose
+    10	  0.1	read
+    10	  0.1	strlen
+    10	  0.1	vtlog
+    10	  0.1	sha1
+    10	  0.1	vtsend
+    10	  0.1	vdcleanup
+    10	  0.1	vtlogopen
+    10	  0.1	vacfileisdir
+    10	  0.1	fileopensource
+    10	  0.1	malloc
+    10	  0.1	vacstat
+    10	  0.1	ropen
+    10	  0.1	fullrune
+    10	  0.1	chksource
+    10	  0.1	vacfilegetmcount
+    10	  0.1	filemetalock
+    10	  0.1	vtstrdup
+    10	  0.1	filelock
+    10	  0.1	vtblockduplock
+    10	  0.1	vtfcallfmt
+    10	  0.1	panicblock
+    10	  0.1	wlock
+    10	  0.1	memfree
+    10	  0.1	sizetodepth
+    10	  0.1	fileload
+    10	  0.1	vtfiletruncate
+    10	  0.1	shrinkdepth
+    10	  0.1	mkindices
+    10	  0.1	wunlock
+    10	  0.1	vtblockwrite
+    10	  0.1	gstring
+    10	  0.1	getcallerpc
+    10	  0.1	packetfragments
+    10	  0.1	fragalloc
+    10	  0.1	canlock
+    10	  0.1	packetappend
+    10	  0.1	sizeS2M
+    10	  0.1	convS2M
+    10	  0.1	_vtrpc
+
+Six seconds in vacfs, yikes. Neoventi?
+
+total: 820
+TEXT 00010000
+    ms      %   sym
+   590	 71.9	unwhack
+   110	 13.4	aindexfromaddr
+    40	  4.8	memcpy
+    30	  3.6	pread
+    20	  2.4	vtreadlookup
+    10	  1.2	vtread
+    10	  1.2	vtreadarenablock
+    10	  1.2	_tas
+
+0.8s, of which most is decompression, but... why is pread in there? Everything is definitely in cache, there's no eviction, we shouldn't be touching disk!!
+
+total: 1950
+TEXT 00010000
+    ms      %   sym
+  1440	 73.8	unwhack
+   160	  8.2	aindexfromaddr
+   120	  6.1	memcpy
+    40	  2.0	vtreadlookup
+    40	  2.0	_tas
+    40	  2.0	memcmp
+    40	  2.0	pread
+    10	  0.5	abort
+    10	  0.5	vtread
+    10	  0.5	vtreadarenablock
+    10	  0.5	vtreadarena
+    10	  0.5	pwrite
+    10	  0.5	trybucket
+    10	  0.5	readclump
+
+ahh! read() calls pread() :P that's waiting on the command from the client, then :P
+
+Hm. 8% of that is also the linear scan of the arena map. Turning that into a hash table would help, too. Caching _after_ decompression would come in handy, too. Can we hack that, without adding a second layer of cache? Probably not; that'd mess with addresses of subsequent blocks. A second layer should be trivial, though. Worst case, cache.c is ~150 lines; another copy is no big deal. Making it aware of which blocks are compressed could be even better.
+
+Should be able to eliminate at least 0.6s from the neoventi side, but that's.. basically nothing.
+
+...performance can wait. This is good, and a 2x improvement's nothing to sneeze at >_<
+
+Cache eviction should be next. Get the read side _fully done_, and I should be able to get fossil running off of it while I implement write support, probably?
--