Translog node: a "comm hub", receiving log streams (using previously set up accounts and statistical billing) from multiple
And then saving the complete multiplexed stream in pcap file. This data is not indexed, so it can be written to disk with "wire-speed" (~10 Mbytes/sec, much better than 150 entries / second as with indexing). Very efficient, low cost. Easy to filter relevant data during crash-recovery. The recovery filter must be an exported interface as well (billed more expensive than daily storage), so human intervention only needed on the recovered node, not the transaction-nodes the recovery data is fetched from for merging.
- In general, most data is stored on it's primary note, where it gets indexed (10 transactions / sec is about 0.1 PC).
- for many transactions, at least 3 indexed operations are performed (market + issuer + wallet, or market + market + wallet, etc...)
- But all data update is also sent to multiple translog node where it's not indexed, but streamed to large-file to disk. (10 transaction / sec is about 0.01 PC).
- if it's logged on it's way back to the client, it becomes extremely unlikely for some data to reach client than get lost in the crashing server.
- every host (that stores important data) should stream its transaction log to geographically detached servers over the network: (let's name) "translog" nodes
- it can be a service running on another server
- service running on clients (freenet style)
- or some small HW (like a wifi router, eg. ASUS wl520gu) logging to an USB pendrive or SD-card.
- pendrive / SDcard connected to the server directly is risky, because lightning, or sever power supply problem can ruin that at the same time of the crash.
- Optical isolators can help though. Expensive for ATM. Maybe cheaper for ethernet. Likely very cheap for RS485 (using fast optocouplers and transient protection diodes) but speed is a bit too low
- after a crash (OS, disk, or even total HW distruction), the latest backup, AND the transaction log can be used to recover without lost transactions
Expected performance (how many transactions others generate that we should be able to log?)
In short: up to 10 Mbyte / sec (almost saturating a 100 Mbit/s line). Server should not degrade with the number of connections like a threaded apache degrades (see below).
- issuer: 100..1000 transactions / second
- real data (/transaction) can be down to only 100 bytes, but the filled epointcert.txt template is ~800 bytes + DSA signature
- currently < 100 / second, likely because of poor signature (or decryption) implementation. AMD64 speed is ~1500 RSA1024 /sec or ~4200 DSA1024 / sec. So a 2 core with nginx and DSA could perform almost 1000 / sec in the end (network + crypto + streamtodisk). If too much indexing is involved (as in the current implementation), speed can reduce below 100 / sec with simple (eg. RAID-1) disk subsystem.
- data from multiple relays and many-many clients: a server could easily store ~1000 records / second, 5kbyte / record (wallet data). (some russian nginx webserver serviced 500 million hits / day. Storing 80 million small datablock without indexing is possible.)
SSD vs spinning disks
transaction speed comparison
It seems reasonable to
- use SSD for the indexed database (issuer and market DB files + indexes).
- SSD is 3 times (this will grow) better speed per $/GByte , and (since SSD $/GByte is 20 times higher, around 2 for SSD vs 0.1 for spinning!) 60 times better speed for random data access than spinning drives
- use spinning hard drives for transaction logs, (filesystem journal too ?) and for multiplexing transaction logs from several issuer + market instances, and regular backup of SSD
With this approach the CPU doing crypto becomes bottleneck (1500 /second), not the disk seek (300 /sec). Only if the webserver implementation allows, of course (with apache threads the OS is stupidly waking up threads continuosly that go to sleep right after, without doing anything useful).
(Strictly monotonous) Sequence counters
help these operations:
- optional synchronization
- and merging (finding missing chunks and filling them from other translog nodes).
Sequence numbers could be generated by:
- wallet client
- actually the wallet client mostly generates new wallet revisions that it stores on the relay (by index !): latest revision needs to be fetchable quickly
- issuer, market or relay (probably most important for us at this point)
- because for maket and issuer we'll use traditional translog methods first.
- the translog server itself, after multiplexing. This would allow discovering missing or truncated logfiles.
- A database (berkeley-DB, postgres or mysql or a text-file) would work equally well:
- storing filename, starting seqnr, recordcount, timestamps, filelength (the rest of the data refers upto this point in the file)
- It is important that for each logfile, these must be stored in the same 512 byte (not 4096 !) block, so written to disk atomically by the harddrive. For 1514.dat file 1514.meta could hold these data (the filename, or anything stored in the directory inode cannot guarantee this atomicity). The libpcap file that has no such meta (or .dat file is actually longer) can be reread and this data corrected.
The binary data (possibly generated on a client) could be prepended with text headers (somewhat similar as an email envelope) as it travels along the way. key=value should be better than free-text.
Any binary or textual data can be logged via the network this way.
- Bigger data must be segmented. 60..512 kbyte might be a reasonable chunk size. (remember that this amount is collected in memory - for each incoming connection - before it's started to written to disk; so 10 Mbyte sounds a bit too high). Respecting the limit, this could be dynamic
- Some translog implementations MAY be able to return the last (or last few) sequencenumbers from same
- this could result in more complete logs (more robust system), possibly allow UDP that could improve performance further
Translog nodes: fast logging over the network
- a client saves data ("files") to multiple (usually 2) storage_nodes
- tcp (http-style) is most administrator-friendly
- but UDP is highest performance
- lightweight authentication can be done via sharedsecret: by sending ID, sequencecounter++, sha1(sharedsecret . sequencecounter); The receiver verifies that sequencecounter has increased (not playback), and that sha1 checksum is correct. Than rejects (if attack) or accepts and applies statistical billing.
- sending base64 data via UDP packets to syslogd would be an interesting hack
- each storage_node sends "transaction log" over the network to multiple (usually 3..7) other "translog nodes".
- This is a very cheap way to minimize risk. Cheap, because the translog file only gets written (to continuous file, not indexed: very few disk-seeks), very rarely needs to be read back.
- translog servers might only accept from certain network addresses (protection can be set up in configuration file, possibly also by firewall administrator for added protection )
- sym session keys might be set up occasionally, and used for several log entries to minimize PKI operations.
- Fast Logging Project for Snort (FLoP) has the network infrastructure implemented. Some extra code hacking is required for sessions, and the specialized data.
- storage_node SHOULD report to a client how many translogs it succeeded to send (client MAY visualize it, present to the user. Green indication COULD be used when 2 storage_nodes were reached, with min 1 + min 2 translog written: that means info is stored at least 2+3=5 places ).
- When some transactions are possibly need recovery (eg. when server starts - especially there was a rude shutdown - but possibly also after an apparently clean exit) data is read and (filtered for relevancy - not of other node's - and) merged from several translog nodes (so no transaction lost even if there are holes in the translog of every transnode).
- This is a very cost-efficient, yet robust mechanism
hack: Linux kernel firewall ULOG target can log entire packets. Would be a giant hack to store transaction log this way. Perhaps not even robust enough: some packets might be missing (preventing assembly of streams). Reimplementing TCP stack inside the userspace logger would be prohibitive. Modified khttpd or any event-based http server is cheaper. See below
cleaner solution: hacking a webserver (eg. nginx, thttpd or mathopd) or using CGI to save data in libpcap format. Many tools can read it.
multiplex log in Libpcap format or sg better ?
- 1 libpcap file is not actually perfectly enough.
- imagine that we have a 1 MByte / sec stream (86.4 GBytes / day), coming from 100 nodes. 1 node needs recovery after a serious HW crash (both drives in the RAID-1 failed in a short period ). The full-backup is 46 hours (1.9 days) old. The disk bandwith available for recovery is 10 Mbyte / second. Recovery time is 4.6 hours. With 50 MByte/second available (RAID-1, almost full bandwidth in streaming mode) still 1 hour.
- if we have an index file, we can bump recovery speed up 46 times (=> 6 minutes): only every 100th datablock is interesting. If it spans 2 blocks on average, we need to read 1 block in 50. That brings recovery time down to 1 / 50 * 86400 = 1728 Mbytes or appr ~180 seconds per day with 10 MBytes / sec recovery. That is about 6 minutes. In this mode 50 MByte / sec (1.2 minutes) is definitely only possible with expensive hardware. But 10 MBytes / sec is easy.
- reasonable recovery time even if last full (indexed, ready-to-run) backup is 1 week old. (daily might be better practice though).
- the index file can easily be recreated from the full libpcap file when necessary
- or the end of the index file, eg. when the index file gets truncated when the system goes down unexpectedly.
Alternative of a separate index file: pointer array points to some earlier datablocks from the same substream, and to a previous pointer array ! For almost no cost, this allows extracting a substream processing the file backwards rather quickly. (usually the normal order needed, so a separate index file - when available - allows faster recovery).
Demultiplexing: aio explains int lio_listio( int mode, struct aiocb *list[], int nent, struct sigevent *sig ); A very nice function. (Not so good 1st thought hint: pvread (in the vein of pread and readv)).
- cell wrote a small lio_listio() testprogram for speed testing: I seem to get travelspeed=57..60 Mbyte/sec almost independent of the extracted portion. eg 1/20 or 1/60 of the file extracted from the multiplexed streams. So effective extract_speed is only 3..1 Mbyte/sec. Not sure if this is due to the 1000 GByte SATA disk or the Linux 2.6.26-2-vserver-amd64 #1 SMP OS
- depending on how many functions are deployed on a real HW (issuer, market, gateway) usually 4..12 streams (each from a HW) are multiplexed (assuming all data is stored like 1 indexed + 4 transactionlog for redundancy). This is because when a HW crashes, we can and should recover the transactionlog substream all of it's services.
- that means ~100 kbyte/second from a HW (~100 transactions per second, or 200 if we only store the real data, not the filled in templates) is the absolute maximum recommended. Since this is a small country's traffic, for other reasons it's best to spread services a bit more to make it less appealing for the FED maffia to interfere with any host.
Extremely poor performance of apache2 (probably any threaded webserver). List of web-servers with reasonable performamce
Numbers below are approximate, with a 2009 PC: 2 x 1000 MHz AMD64 Athlon processors (or similar ~3GHz PC), standard Debian config, small static files over the 100 Mbit/sec network:
- Apache2: 1200 / sec (2500 / sec with keepalive)
- most say that apache can service appr 30-60 Mbit/s unless requests are big
- apache with svn behind nginx seems to be a reasonable compromise. Nginx can rate-limit, and protect apache.
- Tomcat + redcent serviced appr 800 / sec of a simple query (/CERT124.asc that does not need to create PKI signature, just returns data from database). Behind apache2, using mpm-worker, performance dropped to appr 215 / sec (This is very poor if we consider that apache didn't have to do any disk or database IO !)
- did not try non-threaded apache configuration.
- nginx (pronounce "engine-X"): 5-6000 / sec (9-12000 /sec with keepalive)
- nginx seems to fill 200 Mbit/sec even with relatively small requests. Roughly 4 times the performance of threaded apache2 on average. With lower CPU load, and more predictibility: less latency deviation.
- serviced 500 million (small) http requests / day on some Russian site. Not bad !
- lighttpd: similar to nginx (popular with Ruby). nginx sourcecode might be a bit cleaner (nginx is originally Russian, but most docs are available in English now).
- thttpd: often used behind nginx (nginx servicing static pages and fast-CGI-s, thttpd servicing old-fashioned CGI-s)
- mathopd: a minimalistic thttpd ? The shortest sourcecode of all ?
- fnord: although short sourcecode: it lost connections when stresstested with ab -n 2000 ! Didn't investigate further
- khttpd: kernel httpd is fastest, but originally for static files only. Saving incoming data to log might require some hack.
Slowaris.pl (or ab or httpperf or any program that creates connections) shows that Apache2 degrades to grave (memory, latency deviation, etc... easy to DOS with even just one client host) when number of connections grow (eg. ab -c 5000 ) while nginx works extremely well. In theory the threaded implementation (like apache) is "dual" with the event implementation (all high performance webservers) so functionality and performance of good implementations should be equivalent, but in practice with threads the OS spends most of it's time with creating threads, context swithching and related administration.
note for the curious: actually most CPU-time is spent in the linux kernel walking linked lists, waking up thousands of threads that immediately go to sleep (except one). Hard to write high performance threaded network application with strict POSIX compliance.
Apache2 failing even the most basic tests, cell was afraid that some (epoll or kqueue) code must be written from scratch for any sane performance network data processor (eg. data multiplexer for backup logs from different sources). Luckily it's written already. nginx looks like a good platform to build on.