shithub: neoventi

--- a/notebook

+++ b/notebook

@@ -1708,3 +1708,416 @@

 on the other hand, evictions should be relatively rare, since I plan on S3-FIFOing.

 Also yeah that totally fixed it :D

+...also I can TOTALLY make walk of /n/vac faster, I just took the wrong approach! I need to rewrite vacfs with a new libventi next ;P

+With everything cached in the neoventi sied, I'm seeing 26 seconds for a full read??? Coulda sworn it was like five, but different image I think. and reform instead of thinkpad, though this is... generally comparable performance-wise, so I'm uncertain. It'll need further testing.

+regardless, seeing six seconds of CPU usage there, which can likely be cut down - and can probably eliminate the network roundtrips by doing some explicit in-fs prefetching to cut it down even further.

+...worth comparing against venti again :P

+neoventi: 69s cold, 26s warm.

+venti:    37.17s cold??? it's winning?! 29 s warm. okay.

+_sob_

+oh wait this is off of /root, so it's not fully cold. right.

+That's _concerningly_ close. What happened, neoventi?? Is it because reform??

+...oh my gosh, this is with profiling enabled :O

+...identical numbers without, though. the hell?

+This might just be hardware, since it's only using one core? IF there's latency bubbles, this might be amplified way beyond...

+Of the 69s for a cold read, I'm only seeing 1.06s of CPU time, of which >80% is decompression.

+CPU usage on a warm run is basically identical; neoventi only needs one second of CPU time for this. That's not terrible at all.

+doesn't explain why a full core is at 100% during this process, though. vacfs isn't using that much either...

+probably.

+ % tprof 49587

+total: 6420

+six seconds. So, between neoventi and vacfs, that's only seven seconds of CPU time?? ...walk itself, maybe??

+not going to try to write prof to its ctl quickly enough (though it'd be easy with a while(! ps | awk...)), just goint to link walk with -p...

+might need to tprof, getting a prof error??

+% time walk /n/vac > /dev/null

+1 Prof errors

+12.46u 3.62s 27.04r 	 walk /n/vac

+a prof... file was generated, but is invalid apparently??

+file system isn't even close to full, so that's not it. ah well.

+WALKLOOP:

+rm /sys/log/walk.prof ; while(! ps | grep -s walk)sleep 0.01; pid=`{ps | grep walk | awk '{print($2)}'};echo profile > /proc/$pid/ctl;while(tprof $pid >> /sys/log/walk.prof){ sleep 0.5}

+...um.

+total: 12760

+TEXT 00010000

+    ms      %   sym

+ 11460	 89.8	seen

+   220	  1.7	_profin

+   200	  1.5	_profout

+   110	  0.8	_callpc

+   110	  0.8	strchr

+    80	  0.6	convM2D

+    70	  0.5	treesplay

+    60	  0.4	memcpy

+    40	  0.3	_restore

+    30	  0.2	pooldel

+    30	  0.2	blocksetdsize

+    30	  0.2	blockcheck

+    20	  0.1	D2B

+    20	  0.1	treelookupgt

+    20	  0.1	Bputc

+    20	  0.1	poolfreel

+    20	  0.1	dirpackage

+    20	  0.1	unlock

+    20	  0.1	walk

+    20	  0.1	statcheck

+    20	  0.1	close

+    10	  0.0	pooladd

+    10	  0.0	dofile

+    10	  0.0	memset

+    10	  0.0	canlock

+    10	  0.0	trim

+    10	  0.0	punlock

+    10	  0.0	open

+    10	  0.0	create

+    10	  0.0	dirread

+    10	  0.0	plock

+    10	  0.0	fullrune

+    10	  0.0	poolallocl

+    10	  0.0	memccpy

+    10	  0.0	pread

+walk is ~13s of CPU time?!

+oh right wait that's with -p lol

+ummmm

+total: 12130

+TEXT 00010000

+    ms      %   sym

+ 11530	 95.0	seen

+   120	  0.9	memmove

+    70	  0.5	convM2D

+    50	  0.4	blockcheck

+    40	  0.3	dirpackage

+    40	  0.3	walk

+    40	  0.3	treesplay

+    20	  0.1	getdsize

+    20	  0.1	poolfreel

+    20	  0.1	blocksetdsize

+    20	  0.1	pooldel

+    20	  0.1	treelookupgt

+    20	  0.1	strchr

+    20	  0.1	s_append

+    10	  0.0	poolalloc

+    10	  0.0	s_terminate

+    10	  0.0	Bwrite

+    10	  0.0	s_free

+    10	  0.0	dirread

+    10	  0.0	dsize2bsize

+    10	  0.0	poolrealloc

+    10	  0.0	pread

+    10	  0.0	_tas

+    10	  0.0	blockmerge

+okay so the profiler is shockingly efficient, tbh. expected way worse

+but also walk is 12s of CPU time?!

+What if I rip out that code and have it walk directories even if it's seen them already? Dangerous for recursive hierarchies, but probably fine for this...

+even worse lol, of course.

+also, for future reference, /n/vac/archive/2023/0707/usr/glenda/src/clone/.git is legitimately damaged on disk. both venti and neoventi yield the same error, e.g.

+walk: couldn't open /n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9: file does not exist: '/n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9'

+Going to need a more efficient seen() implementation?

+Two easy ideas: switch it to fibbonacci hashing for picking a cache bucket, and bumping the number of buckets from 256 to e.g. 2048...

+Interrupted that walk at 300s lol. Much worse without seen() :P

+Testing with 2048 buckets... ~16.4s haha, that's almost a 50% reduction >_<

+Obvious question: what about 16384 buckets? lol

+also, of note: this cut the warm venti time to 19s, too!

+64K buckets, neoventi warm: ~15s, too. Dropping that to 16K seems good. Overhead is on the order of eight bytes per bucket, so this should be less than 1M of additional RAM usage, and - bonus - I suspect that this scales better since a larger number of smaller buckets, with conservative allocation, should end up wasting less space, given that buckets start with no space allocated. That's conjecture though.

+Regardless, it's nearly twice as fast!

+it also extended neoventi's lead a bit, I think? venti went from 29 to 19, neoventi went from 26 to 15 ;P

+walk uses `dir->qid.path^dir->qid.vers` as the key to the cache table. That's... i have questions about how uniformly distributed this will be. let's try fibbonaccizing it.

+_crosses fingers..._

+//

+With 4096 buckets, it's... still the same, just worse. Neato.

+8K buckets seems sufficient.

+Can fuck with bucket growth rate? or initial capacity? (to reduce reallocs/memmoves)

+no clear impact, which makes sense; 95% of runtime is in seen() itself, not memory management / shuffling.

+Could try caching just by QID, and then subcaching by version? An extra layer of indirection, but fewer comparisons, in principle.

+Actually, let's just try removing version from bucket selection...

+...also, oops, may have fucked up networking badly by unplugging phone while usb/ether was active?? looks like that may be causing vacfs to hang??

+aux/dial is hanging dialing 127.1!14011, I think it may be trying to use the nonexistent phone for it :/

+also fossil might be hanging. um.

+I ran a `fsys main sync` and... it can't reach venti, maybe???

+home fs is fine. I'm going to save this and shut down. Well, after a minute. Can't run a proper sync :(

+Sanity test: is everything good?

+% sync

+main...

+prompt: srv: already serving 'fscons'

+srv -A boot

+prompt: archive vac:10d05a5ed61e46b309dc374daaccc362dbb48783

+archive vac:c28717a76c153eed834d684c7eff35814cb25b4c

+fsys main sync

+	main sync: wrote 2 blocks

+done

+home...

+sys home sync

+	home sync: wrote 1 blocks

+done

+Okay, cool.

+% mk fg

+./7.neoventi -c /dev/jasnah/arenas

+Initializing neoventi build 5... TODO: validate initial stateinitialized, launching server...overridding tcp address for simultaneous testing! tcp!*!14011...

+% mk vac && walk /n/vac

+% mk vac && walk /n/vac

+% mk vac && walk /n/vac

+% mk vac && walk /n/vac

+Got four walks running and it seems all fine. CPU load is at maybe 50%? Should be better, could be a lot worse.

+Should import ori's walk patch.

+(All four walks finished without issue :)

+Walk seems to be using ~1s of CPU time at the moment. I suspect that'll be even better with ori's approach, since it'll reduce the number of entries we're keeping; we also won't need so many buckets there except when we _want_ full deduplication.

+total: 990

+TEXT 00010000

+    ms      %   sym

+   380	 38.3	seen

+   120	 12.1	convM2D

+    80	  8.0	memmove

+    40	  4.0	pread

+    40	  4.0	blockcheck

+    30	  3.0	Bwrite

+    30	  3.0	treesplay

+    20	  2.0	poolreallocl

+    20	  2.0	dofile

+    20	  2.0	poolallocl

+    20	  2.0	open

+    20	  2.0	close

+    20	  2.0	dirpackage

+    20	  2.0	strchr

+    20	  2.0	strlen

+    10	  1.0	strcmp

+    10	  1.0	s_newalloc

+    10	  1.0	memset

+    10	  1.0	_tas

+    10	  1.0	walk

+    10	  1.0	pooladd

+    10	  1.0	pooldel

+    10	  1.0	poolcompactl

+    10	  1.0	s_free

+    10	  1.0	blocksetdsize

+    10	  1.0	dsize2bsize

+Okay, what about vacfs?

+total: 6340

+TEXT 00010000

+    ms      %   sym

+   670	 10.5	_sha1block

+   290	  4.5	treesplay

+   260	  4.1	blocksetdsize

+   250	  3.9	blockcheck

+   230	  3.6	lock

+   190	  2.9	trim

+   180	  2.8	pooladd

+   170	  2.6	pooldel

+   170	  2.6	memcmp

+   150	  2.3	blockmerge

+   150	  2.3	_tas

+   130	  2.0	memset

+   130	  2.0	vdunpack

+   120	  1.8	dsize2bsize

+   120	  1.8	memmove

+   120	  1.8	poolallocl

+   120	  1.8	vtentryunpack

+   110	  1.7	unlock

+   110	  1.7	poolfree

+   110	  1.7	poolfreel

+    90	  1.4	strchr

+    90	  1.4	plock

+    80	  1.2	vtcacheglobal

+    80	  1.2	D2B

+    80	  1.2	convD2M

+    80	  1.2	treelookupgt

+    80	  1.2	blocksetsize

+    70	  1.1	vtblockput

+    70	  1.1	io

+    60	  0.9	qunlock

+    60	  0.9	mallocz

+    60	  0.9	newfid

+    60	  0.9	punlock

+    50	  0.7	packettrim

+    50	  0.7	pread

+    50	  0.7	vdcopy

+    40	  0.6	vacfilewalk

+    40	  0.6	_barrier

+    40	  0.6	pwrite

+    40	  0.6	vtmalloc

+    40	  0.6	getdsize

+    40	  0.6	stringunpack

+    40	  0.6	upheap

+    40	  0.6	vtfileunlock

+    40	  0.6	convM2S

+    40	  0.6	vtfileblock

+    30	  0.4	poolalloc

+    30	  0.4	vtfilelock2

+    30	  0.4	packetsplit

+    30	  0.4	meunpack

+    30	  0.4	memalloc

+    30	  0.4	setmalloctag

+    30	  0.4	heapdel

+    30	  0.4	vacdirread

+    30	  0.4	pqid

+    20	  0.3	vderead

+    20	  0.3	direntrysize

+    20	  0.3	permf

+    20	  0.3	rwalk

+    20	  0.3	runlock

+    20	  0.3	B2D

+    20	  0.3	mbresize

+    20	  0.3	packetsize

+    20	  0.3	fileloadblock

+    20	  0.3	_vtrecv

+    20	  0.3	checksize

+    20	  0.3	vtcachebumpblock

+    20	  0.3	strcmp

+    20	  0.3	setrealloctag

+    20	  0.3	free

+    20	  0.3	vtparsescore

+    20	  0.3	_vtsend

+    20	  0.3	blockwalk

+    20	  0.3	downheap

+    10	  0.1	vdeclose

+    10	  0.1	read

+    10	  0.1	strlen

+    10	  0.1	vtlog

+    10	  0.1	sha1

+    10	  0.1	vtsend

+    10	  0.1	vdcleanup

+    10	  0.1	vtlogopen

+    10	  0.1	vacfileisdir

+    10	  0.1	fileopensource

+    10	  0.1	malloc

+    10	  0.1	vacstat

+    10	  0.1	ropen

+    10	  0.1	fullrune

+    10	  0.1	chksource

+    10	  0.1	vacfilegetmcount

+    10	  0.1	filemetalock

+    10	  0.1	vtstrdup

+    10	  0.1	filelock

+    10	  0.1	vtblockduplock

+    10	  0.1	vtfcallfmt

+    10	  0.1	panicblock

+    10	  0.1	wlock

+    10	  0.1	memfree

+    10	  0.1	sizetodepth

+    10	  0.1	fileload

+    10	  0.1	vtfiletruncate

+    10	  0.1	shrinkdepth

+    10	  0.1	mkindices

+    10	  0.1	wunlock

+    10	  0.1	vtblockwrite

+    10	  0.1	gstring

+    10	  0.1	getcallerpc

+    10	  0.1	packetfragments

+    10	  0.1	fragalloc

+    10	  0.1	canlock

+    10	  0.1	packetappend

+    10	  0.1	sizeS2M

+    10	  0.1	convS2M

+    10	  0.1	_vtrpc

+Six seconds in vacfs, yikes. Neoventi?

+total: 820

+TEXT 00010000

+    ms      %   sym

+   590	 71.9	unwhack

+   110	 13.4	aindexfromaddr

+    40	  4.8	memcpy

+    30	  3.6	pread

+    20	  2.4	vtreadlookup

+    10	  1.2	vtread

+    10	  1.2	vtreadarenablock

+    10	  1.2	_tas

+0.8s, of which most is decompression, but... why is pread in there? Everything is definitely in cache, there's no eviction, we shouldn't be touching disk!!

+total: 1950

+TEXT 00010000

+    ms      %   sym

+  1440	 73.8	unwhack

+   160	  8.2	aindexfromaddr

+   120	  6.1	memcpy

+    40	  2.0	vtreadlookup

+    40	  2.0	_tas

+    40	  2.0	memcmp

+    40	  2.0	pread

+    10	  0.5	abort

+    10	  0.5	vtread

+    10	  0.5	vtreadarenablock

+    10	  0.5	vtreadarena

+    10	  0.5	pwrite

+    10	  0.5	trybucket

+    10	  0.5	readclump

+ahh! read() calls pread() :P that's waiting on the command from the client, then :P

+Hm. 8% of that is also the linear scan of the arena map. Turning that into a hash table would help, too. Caching _after_ decompression would come in handy, too. Can we hack that, without adding a second layer of cache? Probably not; that'd mess with addresses of subsequent blocks. A second layer should be trivial, though. Worst case, cache.c is ~150 lines; another copy is no big deal. Making it aware of which blocks are compressed could be even better.

+Should be able to eliminate at least 0.6s from the neoventi side, but that's.. basically nothing.

+...performance can wait. This is good, and a 2x improvement's nothing to sneeze at >_<

+Cache eviction should be next. Get the read side _fully done_, and I should be able to get fossil running off of it while I implement write support, probably?

--

⑨