Network File Storage and onion (anonymizing gateway) services for
EpointSystem
Important: do NOT think of it as DHT (distributed hash table) "cloud". There are severe performance implications of this: eg. 100 vs 1500 file storage (capability) / second. Note: while applying high redundancy, we plan to consume about 1..10% IO-throughput (and even less CPU load) of involved servers for the file storage services, and we want high level of DOS-resistance. So performance matters.
- we use STORE(), FETCH() words instead of PUT(), GET() to emphasize this
3 kinds of indexing (and lookup) provided by storage nodes:
- handle provided by server. Useful for logging (certificates, archiving unlikely to be needed data)
- FETCH( <handle provided by the server at STORE time>)
- (alphanumeric) key provided by the client (like "hist/2010-12"). Roughly equivalent with a standard filesystem.
- this is the only one implemented currently, although currently without billing.
- key+(numeric) revision provided by the client. Roughly equivalent with subversion or VMS filesystem.
- fetching revision rev=1..max, or rev=0 means latest (max) revision requested.
Storage nodes may advertise different price for the storage and lookup of each class (growing in listed order). The same STORE() interface can be used. If revision=.... missing
- the "bootstrap" (fetching the root inode of an arbitrarily large directory tree) is somewhat special - (special anyway because of DOS-resistance related accounting ).
-
- Currently implemented by key=sha1( passphrase + "save" ); example for passphrase="k" : echo -n "ksave" | sha1sum
f7109868545680a47c4db14630b23a67b588a592
- currently clients do NOT save revision nr. revision nr would be helpful
- debugging
- avoid the accidental case when wallet gets overwritten by unusable truncated/crippled file
-
- syncronisation: when different revision nr files are found on certain servers. Obviously the client does not know the revision number before looking around on servers (storage nodes). So fetching the highest revision nr will be the most common operation, fetching older data sometimes also useful.
- The client can verify in several ways that data is proper (mdc message digest, etc..) but the server cannot. Sticking a pubkey for every user on the server so at least the revision => digest can be verified on the server (useful when replicating between servers) would have some advantages
- client also provides the hashvalue of the file, so the server has freedom to use that too. Also useful for verification and future developments. Obviously the client has no way to find out this during bootstrap ! So the above method is used for getting wallet.root inode
Easy setup, low startup costs and high redundancy preferrably from the start => We want this (redundant) service to be deployed as wide as possible.
- Applet => webservice or servlet => database
- Applet => perl or C or php => database
- Besides a JAVA server, this should be possible through traditional hostings like hostmonster.com (bluehost) that provide high volume hosting (>300 GByte storage and >1000 GByte / month traffic) but do not support JAVA (for a reason).
- However, on such host the solution need not be sized to insert more than 10 documents per second. We expect most groups to contract with some professionals, but still use similar hostings (also useful for their other webhosting tasks, wiki-s, forums, etc...) as an "own backup, if all else fails".
Expected benchmark results: (on a vanilla PC in 2009, when fully exhausted by these transactions - during stresstests: ~100+ times higher load than normal production traffic we expect in the near future)
- We expect to be able to insert 10..50 documents and (simultaneously) fetch 5 documents per second in the storage server without prohibitive server load (possibly on the same HW where the issuer is running, doing 60+ CPU-intensive issuer transactions per second)
- note that we expect MUCH more inserts than retrievals: because of the redundant backup usage of this storage network, one wallet might choose to store historical logs at 5 different places. Some data might never actually be needed (or maybe queried for verification to hunt down misbehaving storage servers). Think about expired epoint RANDS, old history, etc... Data you might want to be able to retreive if something goes wrong, but normally just not needed.
- We expect typical document size ~2kbyte .. 100 kbyte. The (encrypted) certificates are rather short. The list of keys that allow retrieving a bigger file in multiple steps (similar to how .torrents work) are also usually short.
- small files during normal transactions are expected (saved to typically 1 node if losing it is no big deal: like when creating new, empty epoint account for future use. Saved to 2..3 nodes if losing them would result in losing epoints)
- Occasional bigger files like concatenation of a set of docs, possibly when "logging out" at the end of the session (possibly saved to 4+ nodes) should be no problem either.
Quick data indexing
See how Kevin reached 20000+ inserts / second using mysql LOAD DATA INFILE with sweet spot ~ 1000 record batches. He used small records. Obviously 2+ kbyte records would result in lower throughput and lower optimal batch size.
Mind-blowing 62949 inserts / second using mysql (restore from dump) 2million records restored in 31.772 seconds with zcat dump.sql.gz | mysql During this operation only primary key index is built. With same sheme, Marcell achieved 48000 inserts / sec (500000 records loaded + indexed < 10.5 seconds) and surprisingly SELECT COUNT(*), SUM(id) FROM weightin WHERE date >= '2001-01-10' AND date < '2001-12-22'; returns immediately after the loading is ready (although only "id" is made key; maybe mysql also adds index on the date field ?? Or just quickly passes the 500000 records in memory and finds the 89758 matching records). Note that we expect indexing performance to degrade to grave before 3 million records (120 MByte index file ~ 120 GByte data area) reached. The solution to scale is to partition according to storage time: that is use (storagetime, hashvalue) as compound key for retrieval (besides hashvalue, also store the storagetime in wallet). So a wallet "directory entry" contains (for each file in the "directory") :
- filetype (eg. history)
- interval: 2009-11-01 .. 2009-12-01
- hashvalue=fa97....3e (SHA1)
- node, storagetime
- node1 (fingerprint + url ?), storagetime
- node1 (fingerprint + url ?), storagetime
- ...
Summary (for peak performance "mostly-write" hashtable:
- use a daemon to receive data. Daemon writes to 2 files.
- concatenates data into the data-file. No need for gzip as we usually store encrypted data anyway. Start new file after reasonable chunks like 50..100 MByte.
- write data hash, length and other metadata into "mysqldump" file. Batches of 5000 to 100000 records (perhaps depending on hits / sec). Load this into mysql. Note that all select queries block during this locking. So try to keep it <0.4 second for light-load servers and < 0.8 second for high-load servers.
- Both can be replicated to several hosts (or "backup" storage) with rsync.
- Perhaps keep 2 sets of datafiles for the guaranteed storage: whether storage-certificate was requested or not. The database can be the same of course.
- Once in 2..3 years, when migrating to new hardware anyway, data can be filtered before piping to mysql so expired records are dumped.
- Separate datafile and database for the non-guaranteed service - these are always separate.
- why not just insert into mysql ? mysql (or postgresql) will write the data and create the index as necessary.
You might wonder why performance matters at all. We want to be able to deploy "secondary backup service, just in case" widely, also to shared web-hosting, where low server-load is essential feature for this type of application (they penalize high load). 1% load per host should be reasonable figure for a "high traffic" storage node. This way we can maintain easy setup, low startup costs and high redundancy - preferrably from the start.
STORE, FETCH requests are best via http:// using TOPUP-time specified "shared secret symmkey".
https:// is possible, but https allows a very cheap DOS attack where the server needs to do a lot of PK calculations without much effort on the attacker side.
- using prepaid accounts, previously topped up with TOPUP( accountID, epointRAND, symmkey ) where symmkey is a shared secret that provides the secret communication channel for later requests of given accountID.
- before the TOPUP happens (in which data is PK encrypted inside the request so secure even without the DOS-prone https) the server sends a challange to the client (secure channel not needed for this). Challange can be sg. like hashcash: eg. to reverse a dozen 18..20 bit hashes.
- The server only begins any costly PK calculations if the client meets the challange. This way any attacking client will need 100 times more CPU resource than the server if it wants the server to perform CPU-eating PK operations.
- hashcash only useful for TOPUP where the server must do PK calculations. For normal PUT, GET requests (that is 99% of cases) the server only applies CPU-cheap symmetric decryption so hashcash is not needed.
- STORE( accountID, symmencrypt( symmkey, wrap(key, value, expiry)) ) is guaranteed service
- alphanumeric handle is returned in response. The handle is necessary in FETCH operation.
- certificate is NOT returned in response
- Typical expire time is 6-11 years (usually >5 years for financial sensitive information). Might be less than 5 years if we (the wallet, as noone else understands the crypted data !) consolidate from time to time (throw away anything certainly not needed and preserve anything that might actually be needed later).
- RSA or other PK - even just for the session setup - would provide a cheap DOS method. TOPUP happens anyway so specifiing symmkey there and using symm encryption here is straightforward.
- REQCERT( accountID, symmencrypt( symmkey, wrap(key, expiry)) ). Returns certificate that we guarantee storage of data with this digest (key) until expiry. Can be queried min 10 (or 30?) seconds after PUT (of same key). This way storage node only signs certificates when data already residing in safe storage (=> lower chance of leaking out CERT right before accidental shutdown when we cannot guarantee safe storage of all data poured in the last couple of seconds).
- FETCH( accountID, handle, key ) is a cheap "GET" method (recommended). This is guaranteed for any file where storage certificate was issued upon REQCERT.
- FETCHROOT( accountID, revision ) is the method to fetch a certain revision of the root inode
- use revision=0 for latest revision (revisions are numbered from 1)
- any files or subdirectories can be queried with FETCH, using content from this file.
- A client might choose to only store wallet in this FETCHROOT document and might never use FETCH. But we recommend that clients store "unlikely to be needed" historic logs under subdirs, therefore not bloating the root inode.
- GET( accountID, key ) is an expensive GET method (not recommended, MAY NOT be supported by server. Avoid when possible). The server can only service a GET without a specified handle for files that has been indexed in the big index
- optional: big index (where primary key is not the handle, but the hashkey is enough by itself to retreive data) : the server builds the big index in memory (containing all keys: eg. merge the new keyfiles -fed to it since last time- to the previous big index). This happens in memory, without prohibitive disk seeks: happens regularly, say monthly. Without this, writing every 40 byte index in the huge bigindex would cause a prohibitive, appr. 8+8 msec disk-seek service time (read one block, + write one block). Without the storagetime trick, 100 transaction /sec would be maximum in the end, not 2000+ transaction / sec.
- optional: PUT( key, value, expiry ) is best-effort service (no guarantee that data is preserved: in case of flooding DOS attacks some of these might be denied, or data purged, even intentionally).
- it would be best to have this indistinguishable from the other put (even when using http, not https). This means that TOPUP could be used to set up freelancer accounts when 0 valid epoints provided, possibly with a second turn of even more hashcash so clients don't flood the freelancer account table either (wether it's same database or separate).
- later: ONION( dest, data ) is a public-key encrypted message to the server, that it needs to pass on to the destination after decrypting. Using several layers is possible.
- placing hooks on the way to find backroute for related response messages (without the server knowing where the client connected from !).
- See Tor and mixmaster, mixminion architecture
Clients decide themselves how they map their keys to certain servers using
- their configured pool
- clientid (passphrase hash as salt)
- and other data
While the client stores her private data in a hash table distributed across several nodes, the nodes themselves store any requested key,value pair (if the client pays for it...)
Statisctical billing
Billing is a powerful tool to prevent DOS floods (consuming either CPU / IO or diskspace). At the same time, we want minimal billing overhead.
Proposal:
- for every query, account is verified: total_topup > total_service_fee (we assume these are fast, usually serviced from memory). Note: no database update in this step.
- with a certain probability, total_service_fee += qfee (eg. 1% of the cases, using a random number generated on the server). This involves a database row update, but only with a low probability. This will cause any DOS-ing account to deplete, without killing total throughput.
- the amount of stored data (and storage time) can be accounted for later, so it's involved in the total_service_fee.
- example: Joe tops up his account with 101 epoints. 1 epoint is fee of the topup operation.
- For 100 (that is ~0.1 EUR) he gets service fee tokens in exchange: total_topup += 100000 (ppmEUR)
- recommended fees: for 0.1 EUR, he should be able to execute 1000+ queries and store 1+MByte for 6 years in the guaranteed storage.
- With a normal usage pattern, more than enough for a month of banking (for a fee that is less than 1/100 of one traditional international transaction, or a typical monthly company bank account fee).
- In short, 10..100 times cheaper than the costs occuring during transactions (transaction fees + hidden in exchange rates) with the monopolistic banking is a reasonable first target. A few billion EUR-s staying in hands of people who produce real wealth. These are just the transaction costs of course. The non-fraudulent money supply will make the real difference. That will be even more significant than just the transaction costs, due to lower total debt (do you mind 60000 billion USD fraudulent debt ? ).
Invoice repository
We want an easy way of "please transfer X EPT to MD" or "Please use MD to transfer EPT to ... recipient". (recipient specifying MD is better - provable, and no chance of RAND being hijacked by epoint pirates - than giving away RAND)
Using a simple RAND => MD display would be extremely prone to user errors:
- user1 mistypes MD
- or recipient forgets RAND for MD he told someone earlier
So we need a convenient infrastructure for it. At the minimum, when user1 types MD (destination) he should be able to verify that "yes, issuer/MD was reserved by ...pseudonym" to provide some safety. An issuer-specific storage would make some sense for this (RAND => MD is same SHA1 function for all issuers, but recipient expects at certain issuer, so that's where we want to check). It would be cool if MD => "free text" could be retrieved. It could be definable at PUT() time if "free text" is still seen after some value was put to MD. By default no, as MD becomes public at this point. So any issuer / MD can have the following "states":
- MD => unknown
- MD => "free text" after PUT(MD, "free text", 0) - 0 specifies the "free text" is not retrievable after MD was topped up with some non-zero value
- MD => non-zero value (eg. after exchange or split operation).
- MD => exchanged after RAND was used (where the digest of RAND is MD)
- MD => unsafe: if RAND became public. With current (too gossipy) transaction log this is equivalent to above.
- naturally we will not be able to detect if RAND was published in other way, like a public forum ( - which is usually stupid).
How to store "invoice" (actually the secret RAND after we tell it's MD to someone who might transfer to us in the future ) in wallet ?
- any "invoice" in wallet could be stored as value=0. Than pocketbook would know it's a prepared destination account (we told the MD to someone else), not to be forgotten
- maybe a separate invoice tag would be more appropriate
Some lookup for these "prepared invoice" MD-s is necessary. The issuer could provide a service of listing MD(s) for "free text". So anyone can transfer to issuer/"Joe3546" if Joe previously prepared a RAND (that he stored in his wallet) and executed issuer->put( MD, "Joe3546", 0).
Storage certificates - occasionally
Storage certificates have very little usefulness normally. It makes more sense to store at +one more node than to get certificate from the existing nodes. However certificates make it possible to prove (better alternative instead of self-experience or unprovable, therefore manipulatable gossip) the fact of data-losing servers.
For this reason a client can rarely request storage certificate (and store on other node). The server can export some streams to public (also useful for replication) and voluntarily provide storage certificates on these (including expiry date). These might be much more useful than certs of individual records.
BALANCE Storage: Encrypt to self
- We earlier assumed that storage is some kind of hash (like chord, freenet, etc...).
- might not be the case always. It may be hash for bootstrap, and using server provided "inode filehandles" (autoincrement serialnumbers) in the usual case.
- Balance revision number N can come from 3 places: trusted party (trusted storage), remember and type-in hint, binary search (see balance bootstrapping)
A full-balance is a list of secret epoint RANDs. (a view-only balance is a list of public epoint MD-s). RANDs with positive value, and designated RANDs with 0 value
Storage of secret content:
- symkey(salt + content) : hashing with a random salt value attached to the FRONT of the document before symmetric encryption.
- where symkey is directly derived from the passphrase. No need to store it anywhere. No asymmetric cryptography is required whatsoever, if you only need to encrypt stuff to yourself.
Recommended symmetric encryption algorithm: .... ?
- We need to retrieve the actual content of our balance, identity_balancerevision_N.
- Where to store N ? => Balance bootstrapping (finding out Balance actual revision "N", after hint from our gateway). See HighlyRedundantStorage (DHT notes and measures to somewhat protect from gateway-tamper attacks).
Binary search:
- to find the latest revision, we issue a binary seach (max log2(N) steps)
- A hint N from the user is ruled out. Hint from the gateway is accepted though.
- if we received any hint, we start by at least 2 steps (N, N+1) in random order (yes!), and revert to binary search if necessary (if both exist, or neither)
- k = hash(privatesalt, identity_balancerevision_N )
- where privatesalt is derived from the passphrase
The k=... hashing, and the verification of N makes an attack or data-loss accident less likely. Especially if the storage is accessed via onion routing (similar way like Tor, just routing messages instead of TCP flows.). Using multiple storage systems further decreases the chance of data-loss due to attack or technical problem.
With a highly redundant network storage, it is possible to store multiple backup replicas by varying privatesalt. The probability that all replicas end up in same host is (1 / numberofhosts) ** (numberofreplicas-1) assuming that all hosts store equally big slice of the keyspace.
Recommendation:
- retreive balance revision N from trusted party
- use hash(privatesalt, identity_balancerevision_N )
- verify with 2 lookups that N exists, N+1 does not
- use binary search if verification fails
- use min 2 storage subsystems for revisions of balance data (symmetric encrypted).
- One contracted party, trusted to store reliably. If using web-balance JAVA applet, it is intuitive that the contracted, secure storage is the server the applet is downloaded from (the applet connects to)
- +1 bkup in a secondary hash. This MAY be a highly redundant network hash-storage.
- some users will also store (each time after creating new subaccount, or occasionally) on their local storage too. Inside the web-balance applet (or separate CGI on the server?), "Dump balance data" + save file to disk
Storage in SKS pgp keyserver network:
Can store whatever you want, but there is one thing to keep in mind: you can only add to keys, but cannot delete. "deletions" are done by revocations, but those also make keys longer.
- So making a change to the key with every transaction is not practical.
- Adding reputation records, however, is very much so.
Node failure model for regression and stresstests
Because of the criticality of some data (primarily the actual data of wallets, secondarily historical data), the system must be tolerant to some insidious failures.
- 1 node lost totally (fire, lightning, HD, BSA or other disasters)
- files written in the last minute before server freeze gets lost
- files written to in the last minute before server freeze seem to be there, with correct length, but with bad data (some or all blocks still have data from earlier files)
- this is very insiduous: a brief check does not notice this.
- this means that some verification needed after server startup, and files need checksum. The checksum is primarily calculated on the client. The server only calculates checksum
- before creating storage certificate
- when verifying the pre-freeze files after startup
The server must take actions to retrieve the data from other nodes when possible. (might not always be possible). Servers might use separate folders for different users so errors during folder-writes do not propagate (eg. losing a folder with 10000 files in it could be bad !). Some filesystems (ext3 ? reiserfs ? oracle's own "filesystem" ) might be quite reliable, but some precautions cannot hurt.
The client must take actions to verify data integrity and copy data to more nodes if below redundancy lower threshold.
Split Large Files to Chunks
- there is a maximum filesize that can be uploaded to a storage service
- the limit must be minimum 64kbyte. Recommended limit is 1..8 MByte
- a JAVA applet client must (and can easily) split to chunks and assemble big files.
- Not sure if > 1Mbyte files ever appear during "normal" wallet operation. Multi-level indexing (a'la minix) is used anyway.
- the limit is either
- advertised
- or specified by the server during the handshake of every SAVE() operation
- or specified by SAVE() upon special parameters.
PHP + MySQL BLOB example
- Uses small table for file metadata
- and big table for 64 kbyte chunks of file data.
- very practical (memory efficient) when streaming out large files to clients
Data model
Funds table
- primary key: storageid
- access allowed if >0. Otherwise only topup is allowed
- sometimes decremented, like once in 20 .. 50.. 100 transactions. See statistical billing.
The funds table might fit in memory and allow an extremely quick and inexpensive (no-disk-seek) lookup during an incoming request.
Inodes table
- primary key is an autoincrement numeric value.
- Server returns this handle during save, so it can be stored in wallet.
- storageid and expire time can also be stored, but need not be indexed
- used during maintenance and cleaning operations
The inodes table can grow very large, but reads are infrequent. Mostly just logging, and only looking back when auditingor if we suspect server naughtyness.
Namedfiles - used for bootstrapping, eg: wallet
- storageid + code
- code can be numeric or alphanumeric
- revision
Bottom up notes