ref: b3f5bf725a5e93719b34c552a15ae9103548a4c2
parent: a2ffdf6dfea9b0128732ca068a84654328b8a038
author: Noam Preil <noam@pixelhero.dev>
date: Tue Aug 13 11:45:33 EDT 2024
update notebook
--- a/notebook
+++ b/notebook
@@ -1708,3 +1708,416 @@
on the other hand, evictions should be relatively rare, since I plan on S3-FIFOing.
Also yeah that totally fixed it :D
+
+...also I can TOTALLY make walk of /n/vac faster, I just took the wrong approach! I need to rewrite vacfs with a new libventi next ;P
+
+With everything cached in the neoventi sied, I'm seeing 26 seconds for a full read??? Coulda sworn it was like five, but different image I think. and reform instead of thinkpad, though this is... generally comparable performance-wise, so I'm uncertain. It'll need further testing.
+
+regardless, seeing six seconds of CPU usage there, which can likely be cut down - and can probably eliminate the network roundtrips by doing some explicit in-fs prefetching to cut it down even further.
+
+...worth comparing against venti again :P
+
+neoventi: 69s cold, 26s warm.
+venti: 37.17s cold??? it's winning?! 29 s warm. okay.
+
+_sob_
+
+oh wait this is off of /root, so it's not fully cold. right.
+
+That's _concerningly_ close. What happened, neoventi?? Is it because reform??
+
+...oh my gosh, this is with profiling enabled :O
+
+...identical numbers without, though. the hell?
+
+This might just be hardware, since it's only using one core? IF there's latency bubbles, this might be amplified way beyond...
+
+Of the 69s for a cold read, I'm only seeing 1.06s of CPU time, of which >80% is decompression.
+
+CPU usage on a warm run is basically identical; neoventi only needs one second of CPU time for this. That's not terrible at all.
+
+doesn't explain why a full core is at 100% during this process, though. vacfs isn't using that much either...
+
+probably.
+
+ % tprof 49587
+total: 6420
+
+six seconds. So, between neoventi and vacfs, that's only seven seconds of CPU time?? ...walk itself, maybe??
+
+not going to try to write prof to its ctl quickly enough (though it'd be easy with a while(! ps | awk...)), just goint to link walk with -p...
+
+might need to tprof, getting a prof error??
+
+% time walk /n/vac > /dev/null
+1 Prof errors
+12.46u 3.62s 27.04r walk /n/vac
+
+a prof... file was generated, but is invalid apparently??
+
+file system isn't even close to full, so that's not it. ah well.
+
+WALKLOOP:
+
+rm /sys/log/walk.prof ; while(! ps | grep -s walk)sleep 0.01; pid=`{ps | grep walk | awk '{print($2)}'};echo profile > /proc/$pid/ctl;while(tprof $pid >> /sys/log/walk.prof){ sleep 0.5}
+
+...um.
+
+total: 12760
+TEXT 00010000
+ ms % sym
+ 11460 89.8 seen
+ 220 1.7 _profin
+ 200 1.5 _profout
+ 110 0.8 _callpc
+ 110 0.8 strchr
+ 80 0.6 convM2D
+ 70 0.5 treesplay
+ 60 0.4 memcpy
+ 40 0.3 _restore
+ 30 0.2 pooldel
+ 30 0.2 blocksetdsize
+ 30 0.2 blockcheck
+ 20 0.1 D2B
+ 20 0.1 treelookupgt
+ 20 0.1 Bputc
+ 20 0.1 poolfreel
+ 20 0.1 dirpackage
+ 20 0.1 unlock
+ 20 0.1 walk
+ 20 0.1 statcheck
+ 20 0.1 close
+ 10 0.0 pooladd
+ 10 0.0 dofile
+ 10 0.0 memset
+ 10 0.0 canlock
+ 10 0.0 trim
+ 10 0.0 punlock
+ 10 0.0 open
+ 10 0.0 create
+ 10 0.0 dirread
+ 10 0.0 plock
+ 10 0.0 fullrune
+ 10 0.0 poolallocl
+ 10 0.0 memccpy
+ 10 0.0 pread
+
+walk is ~13s of CPU time?!
+
+oh right wait that's with -p lol
+
+ummmm
+
+total: 12130
+TEXT 00010000
+ ms % sym
+ 11530 95.0 seen
+ 120 0.9 memmove
+ 70 0.5 convM2D
+ 50 0.4 blockcheck
+ 40 0.3 dirpackage
+ 40 0.3 walk
+ 40 0.3 treesplay
+ 20 0.1 getdsize
+ 20 0.1 poolfreel
+ 20 0.1 blocksetdsize
+ 20 0.1 pooldel
+ 20 0.1 treelookupgt
+ 20 0.1 strchr
+ 20 0.1 s_append
+ 10 0.0 poolalloc
+ 10 0.0 s_terminate
+ 10 0.0 Bwrite
+ 10 0.0 s_free
+ 10 0.0 dirread
+ 10 0.0 dsize2bsize
+ 10 0.0 poolrealloc
+ 10 0.0 pread
+ 10 0.0 _tas
+ 10 0.0 blockmerge
+
+okay so the profiler is shockingly efficient, tbh. expected way worse
+
+but also walk is 12s of CPU time?!
+
+What if I rip out that code and have it walk directories even if it's seen them already? Dangerous for recursive hierarchies, but probably fine for this...
+
+even worse lol, of course.
+
+also, for future reference, /n/vac/archive/2023/0707/usr/glenda/src/clone/.git is legitimately damaged on disk. both venti and neoventi yield the same error, e.g.
+
+walk: couldn't open /n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9: file does not exist: '/n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9'
+
+Going to need a more efficient seen() implementation?
+
+Two easy ideas: switch it to fibbonacci hashing for picking a cache bucket, and bumping the number of buckets from 256 to e.g. 2048...
+
+Interrupted that walk at 300s lol. Much worse without seen() :P
+
+Testing with 2048 buckets... ~16.4s haha, that's almost a 50% reduction >_<
+
+Obvious question: what about 16384 buckets? lol
+
+also, of note: this cut the warm venti time to 19s, too!
+
+64K buckets, neoventi warm: ~15s, too. Dropping that to 16K seems good. Overhead is on the order of eight bytes per bucket, so this should be less than 1M of additional RAM usage, and - bonus - I suspect that this scales better since a larger number of smaller buckets, with conservative allocation, should end up wasting less space, given that buckets start with no space allocated. That's conjecture though.
+
+Regardless, it's nearly twice as fast!
+
+it also extended neoventi's lead a bit, I think? venti went from 29 to 19, neoventi went from 26 to 15 ;P
+
+walk uses `dir->qid.path^dir->qid.vers` as the key to the cache table. That's... i have questions about how uniformly distributed this will be. let's try fibbonaccizing it.
+
+_crosses fingers..._
+
+//
+With 4096 buckets, it's... still the same, just worse. Neato.
+
+8K buckets seems sufficient.
+
+Can fuck with bucket growth rate? or initial capacity? (to reduce reallocs/memmoves)
+
+no clear impact, which makes sense; 95% of runtime is in seen() itself, not memory management / shuffling.
+
+Could try caching just by QID, and then subcaching by version? An extra layer of indirection, but fewer comparisons, in principle.
+
+Actually, let's just try removing version from bucket selection...
+
+...also, oops, may have fucked up networking badly by unplugging phone while usb/ether was active?? looks like that may be causing vacfs to hang??
+
+aux/dial is hanging dialing 127.1!14011, I think it may be trying to use the nonexistent phone for it :/
+
+also fossil might be hanging. um.
+
+I ran a `fsys main sync` and... it can't reach venti, maybe???
+
+home fs is fine. I'm going to save this and shut down. Well, after a minute. Can't run a proper sync :(
+
+Sanity test: is everything good?
+
+% sync
+main...
+prompt: srv: already serving 'fscons'
+srv -A boot
+prompt: archive vac:10d05a5ed61e46b309dc374daaccc362dbb48783
+archive vac:c28717a76c153eed834d684c7eff35814cb25b4c
+fsys main sync
+ main sync: wrote 2 blocks
+done
+home...
+sys home sync
+ home sync: wrote 1 blocks
+done
+
+Okay, cool.
+
+% mk fg
+./7.neoventi -c /dev/jasnah/arenas
+Initializing neoventi build 5... TODO: validate initial stateinitialized, launching server...overridding tcp address for simultaneous testing! tcp!*!14011...
+
+% mk vac && walk /n/vac
+% mk vac && walk /n/vac
+% mk vac && walk /n/vac
+% mk vac && walk /n/vac
+
+Got four walks running and it seems all fine. CPU load is at maybe 50%? Should be better, could be a lot worse.
+
+Should import ori's walk patch.
+
+(All four walks finished without issue :)
+
+Walk seems to be using ~1s of CPU time at the moment. I suspect that'll be even better with ori's approach, since it'll reduce the number of entries we're keeping; we also won't need so many buckets there except when we _want_ full deduplication.
+
+total: 990
+TEXT 00010000
+ ms % sym
+ 380 38.3 seen
+ 120 12.1 convM2D
+ 80 8.0 memmove
+ 40 4.0 pread
+ 40 4.0 blockcheck
+ 30 3.0 Bwrite
+ 30 3.0 treesplay
+ 20 2.0 poolreallocl
+ 20 2.0 dofile
+ 20 2.0 poolallocl
+ 20 2.0 open
+ 20 2.0 close
+ 20 2.0 dirpackage
+ 20 2.0 strchr
+ 20 2.0 strlen
+ 10 1.0 strcmp
+ 10 1.0 s_newalloc
+ 10 1.0 memset
+ 10 1.0 _tas
+ 10 1.0 walk
+ 10 1.0 pooladd
+ 10 1.0 pooldel
+ 10 1.0 poolcompactl
+ 10 1.0 s_free
+ 10 1.0 blocksetdsize
+ 10 1.0 dsize2bsize
+
+Okay, what about vacfs?
+
+total: 6340
+TEXT 00010000
+ ms % sym
+ 670 10.5 _sha1block
+ 290 4.5 treesplay
+ 260 4.1 blocksetdsize
+ 250 3.9 blockcheck
+ 230 3.6 lock
+ 190 2.9 trim
+ 180 2.8 pooladd
+ 170 2.6 pooldel
+ 170 2.6 memcmp
+ 150 2.3 blockmerge
+ 150 2.3 _tas
+ 130 2.0 memset
+ 130 2.0 vdunpack
+ 120 1.8 dsize2bsize
+ 120 1.8 memmove
+ 120 1.8 poolallocl
+ 120 1.8 vtentryunpack
+ 110 1.7 unlock
+ 110 1.7 poolfree
+ 110 1.7 poolfreel
+ 90 1.4 strchr
+ 90 1.4 plock
+ 80 1.2 vtcacheglobal
+ 80 1.2 D2B
+ 80 1.2 convD2M
+ 80 1.2 treelookupgt
+ 80 1.2 blocksetsize
+ 70 1.1 vtblockput
+ 70 1.1 io
+ 60 0.9 qunlock
+ 60 0.9 mallocz
+ 60 0.9 newfid
+ 60 0.9 punlock
+ 50 0.7 packettrim
+ 50 0.7 pread
+ 50 0.7 vdcopy
+ 40 0.6 vacfilewalk
+ 40 0.6 _barrier
+ 40 0.6 pwrite
+ 40 0.6 vtmalloc
+ 40 0.6 getdsize
+ 40 0.6 stringunpack
+ 40 0.6 upheap
+ 40 0.6 vtfileunlock
+ 40 0.6 convM2S
+ 40 0.6 vtfileblock
+ 30 0.4 poolalloc
+ 30 0.4 vtfilelock2
+ 30 0.4 packetsplit
+ 30 0.4 meunpack
+ 30 0.4 memalloc
+ 30 0.4 setmalloctag
+ 30 0.4 heapdel
+ 30 0.4 vacdirread
+ 30 0.4 pqid
+ 20 0.3 vderead
+ 20 0.3 direntrysize
+ 20 0.3 permf
+ 20 0.3 rwalk
+ 20 0.3 runlock
+ 20 0.3 B2D
+ 20 0.3 mbresize
+ 20 0.3 packetsize
+ 20 0.3 fileloadblock
+ 20 0.3 _vtrecv
+ 20 0.3 checksize
+ 20 0.3 vtcachebumpblock
+ 20 0.3 strcmp
+ 20 0.3 setrealloctag
+ 20 0.3 free
+ 20 0.3 vtparsescore
+ 20 0.3 _vtsend
+ 20 0.3 blockwalk
+ 20 0.3 downheap
+ 10 0.1 vdeclose
+ 10 0.1 read
+ 10 0.1 strlen
+ 10 0.1 vtlog
+ 10 0.1 sha1
+ 10 0.1 vtsend
+ 10 0.1 vdcleanup
+ 10 0.1 vtlogopen
+ 10 0.1 vacfileisdir
+ 10 0.1 fileopensource
+ 10 0.1 malloc
+ 10 0.1 vacstat
+ 10 0.1 ropen
+ 10 0.1 fullrune
+ 10 0.1 chksource
+ 10 0.1 vacfilegetmcount
+ 10 0.1 filemetalock
+ 10 0.1 vtstrdup
+ 10 0.1 filelock
+ 10 0.1 vtblockduplock
+ 10 0.1 vtfcallfmt
+ 10 0.1 panicblock
+ 10 0.1 wlock
+ 10 0.1 memfree
+ 10 0.1 sizetodepth
+ 10 0.1 fileload
+ 10 0.1 vtfiletruncate
+ 10 0.1 shrinkdepth
+ 10 0.1 mkindices
+ 10 0.1 wunlock
+ 10 0.1 vtblockwrite
+ 10 0.1 gstring
+ 10 0.1 getcallerpc
+ 10 0.1 packetfragments
+ 10 0.1 fragalloc
+ 10 0.1 canlock
+ 10 0.1 packetappend
+ 10 0.1 sizeS2M
+ 10 0.1 convS2M
+ 10 0.1 _vtrpc
+
+Six seconds in vacfs, yikes. Neoventi?
+
+total: 820
+TEXT 00010000
+ ms % sym
+ 590 71.9 unwhack
+ 110 13.4 aindexfromaddr
+ 40 4.8 memcpy
+ 30 3.6 pread
+ 20 2.4 vtreadlookup
+ 10 1.2 vtread
+ 10 1.2 vtreadarenablock
+ 10 1.2 _tas
+
+0.8s, of which most is decompression, but... why is pread in there? Everything is definitely in cache, there's no eviction, we shouldn't be touching disk!!
+
+total: 1950
+TEXT 00010000
+ ms % sym
+ 1440 73.8 unwhack
+ 160 8.2 aindexfromaddr
+ 120 6.1 memcpy
+ 40 2.0 vtreadlookup
+ 40 2.0 _tas
+ 40 2.0 memcmp
+ 40 2.0 pread
+ 10 0.5 abort
+ 10 0.5 vtread
+ 10 0.5 vtreadarenablock
+ 10 0.5 vtreadarena
+ 10 0.5 pwrite
+ 10 0.5 trybucket
+ 10 0.5 readclump
+
+ahh! read() calls pread() :P that's waiting on the command from the client, then :P
+
+Hm. 8% of that is also the linear scan of the arena map. Turning that into a hash table would help, too. Caching _after_ decompression would come in handy, too. Can we hack that, without adding a second layer of cache? Probably not; that'd mess with addresses of subsequent blocks. A second layer should be trivial, though. Worst case, cache.c is ~150 lines; another copy is no big deal. Making it aware of which blocks are compressed could be even better.
+
+Should be able to eliminate at least 0.6s from the neoventi side, but that's.. basically nothing.
+
+...performance can wait. This is good, and a 2x improvement's nothing to sneeze at >_<
+
+Cache eviction should be next. Get the read side _fully done_, and I should be able to get fossil running off of it while I implement write support, probably?
--
⑨