Websites, Hosting and Friends: Filesystem

A couple months ago, I did a comparison of different distributed filesystems. It came out that GlusterFS was the easiest and most feature full, but it was slow. Since I would really like to use it, I decided to give another chance. Instead of doing raw benchmarks using sysbench, I decided to stress test a basic installation of the three PHP frameworks/CMS I use the most using siege.

My test environment:

MacBook Pro (Late 2003, Retina, i7 2.66 Ghz)
PCIe-based Flash Storage
2-4 virtuals machines using VMware Fusion 4, each with 2 GB of RAM.
Ubuntu 13.10 server edition with PHP 5.5 and OPCache enabled
GlusterFS running on all VMs with a volume in replica mode
The volume was mounted using nodiratime,noatime using GlusterFS native driver (NFS was slower)

The test:

siege -c 20 -r 5 http://localhost/foo # Cache warming
siege -c 20 -r 100 http://localhost/foo # Actual test

I then compared the local filesystem (inside the VM) vs the Gluster volume using these setups:

2 nodes, 4 cores per node
2 nodes, 2 cores per node
4 nodes, 2 cores per node

The compared value is the total time to serve 20 x 100 requests in parallel.
All tests were ran 2-3 times while my computer was doing nothing and the results were very consistent.

		Symfony	Wordpress	Drupal	Average
2 nodes 4 cores	Local	2.91 s	9.92 s	5.39 s	6.07 s
2 nodes 4 cores	Gluster	10.84 s	23.94 s	7.81 s	14.20 s
2 nodes 2 cores	Local	5.41 s	19.14 s	9.67 s	11.41 s
2 nodes 2 cores	Gluster	25.05 s	31.91 s	15.17 s	24.04 s
4 nodes 2 cores	Local	5.57 s	19.6 s	9.79 s	11.65 s
4 nodes 2 cores	Gluster	30.56 s	35.92 s	18.36 s	28.28 s
Local vs Gluster	2 nodes, 4 cores	273 %	141 %	45 %	153 %
	2 nodes, 2 cores	363 %	67 %	57 %	162 %
	4 nodes, 2 cores	449 %	83 %	88 %	206 %
	Average	361 %	97 %	63 %	174 %
2 nodes vs 4 nodes	Local	3 %	2 %	1 %	2 %
2 nodes vs 4 nodes	Gluster	22 %	13 %	21 %	19 %
4 cores vs 2 cores	Local	86 %	93 %	79 %	86 %
4 cores vs 2 cores	Gluster	131 %	33 %	94 %	86 %

Google spreadsheet

Observations:

Red — Wordpress and Drupal have an acceptable loss in performance under Gluster, but Symfony is catastrophic.
Blue — The local tests are slightly slower when using 4 nodes vs 2 nodes. This is normal, my computer had 4 VMs running.
Green — The gluster tests are 20% slower on a 4 node setup because there is more communication between the nodes to keep them all in sync. 20% overhead for double the nodes isn’t that bad.
Purple — The local tests are 85% quicker using 4 cores vs 2 cores. A bit under 100% is normal, there is always some overhead to parallel processing.
Yellow — For the Gluster tests, Symfony and Drupal scale very well with the number of nodes, but Wordpress is stalling, I am not sure why.

I am still not sure why Symfony is so much slower on GlusterFS, but really, I can’t use it in production for the moment because I/O is already the weak point of my infrastructure. I am in the process of looking for a different hosting solution, maybe it will be better then.

This post is part of: Guide to replicated LAMP stack hosting with failover

The core concept of choosing a filesystem for a Web hosting cluster is to eliminate single points of failure, but sometimes it is just not easy like that. A true distributed system will still need to be performant, at least on reads. The problem relies in the fact that the bottleneck if very often the I/O so if your filesystem is not performant, you will end up spending a fortune on scaling, without gaining real performance.

Making priorities

You can’t have everything, so start by making a list of priorities. Different systems will have different needs, but I figured I could afford a possibility of failure as long as the system could be restorable since I would be keeping periodic backups.

Low maintenance

It must be possible to read/write from any folder without adding a manifest for each site.
The system must be completely autonomous and require no maintenance from a sysadmin. (Conflict management).

Simple / Cheap

Must be installed on each Web nodes or a maximum of 2 small/medium extra nodes
Must run on Ubuntu, without recompiling the kernel. Kernel modules are acceptable.

Performant

Reads less than 50% slower than standard ext3 reads.
Writes less than 80% slower than standard ext3 writes.
Must be good at handling a lot of small files. Currently, my server hosts 470k files for a total of 6.8 GB. That is an average of 15 KB per file!

Consistency
Changes must propagate to all servers within 5 seconds.

Uploaded files stored in database but not yet synced may generate some errors for a short period if viewed by other users on other servers.
Temporary files are only relevant on the local machine so a delay is not a big deal.
HTTP Sessions will be sticky at the LodeBalancer level so user specific information will be handled properly.

Must handle ACLs

For permissions to be set perfectly, we will be using ACLs.
ACLs may not be readable within the Web node, but they must still be enforced.

Durability

Must handle filesystem failures — be repairable very quickly.
File losses are acceptable in the event of a filesystem failure.
Filesystem must continue to function even if a Web node goes offline.
No single point of failure. If there is one, if must be isolated on its own machine.

A. Synchronisation

Synchronisation means that there is no filesystem solution, all the files are stored on the local filesystem and synchronisation is made with the other nodes periodically or by watching I/O events.

Cluster synchronisation involving replication between all the nodes is usually very hard. To improve performance and reduce the risk of conflicts, it is often a good idea to elect a replication leader and a backup. If the leader is unavailable, the backup will be used instead. This way, all the nodes will sync with only one.

Pros

Very fast read/write
Very simple to setup

Cons

May have troubles synchronizing ACLs
May generate a lot of I/O
Will most likely generate conflicts

Rsync

The typical tool for fast file syncing is rsync. It is highly reliable and a bit of BASH scripting will get you started. However, as the number of files grows, it may become slow. For around a million files, it may easily take over 5 seconds. With our needs, it means it will have to run continuously, which will generate a lot of I/O and thus impact the overall performance.

Csync2

Csync2 is a promising tool that works like rsync, but it keeps file hints in a SQLite database. When a file changes, it flags in the database that the file needs checking. This way, the full sync only needs to check marked files.

Csync2 supports multi-master replication and slaves (receive-only). However, I found while testing that it is not really adapted to a lot of small files changing frequently: it tends to generate a lot of conflicts that need to be attended manually.

It may not be the best solution for Web hosting, but for managing deployment of libraries or similar tasks, it would be awesome.

B. Simple sharing (NFS)

Even simpler than file syncing is plain old sharing. A node is responsible of hosting the files and serves the files directly. Windows uses Samba/CIFS, Mac uses AFP and Linux uses NFS.

NFS is very old, like 1989 old. Even the latest version, NFSv4, came around in 2000. This means it is very stable and very good at what it does.

Pros

Supports ACLs (NFSv4)
Very cheap and simple setup
Up to a certain scale, fast read/write

Cons

Single point of failure
Hard to setup proper failover
Not scalable

C. Distributed / Replicated

A distributed filesystem may operate at a device, block or inode level. You can think of this a bit like a database cluster. It usually involves journals and is the most advanced solution.

Pros

Very robust
Scalable

Cons

Writes are often painfully slow
Reads can also be slow
Often complex to setup

GlusterFS

Gluster runs over Fuse and NFS. Each node can have its own block and the daemon handles the replication transparently, without the needs of a management node.

Overall, it is very good software, the write performance is decent and it handles failures quite well. There has been a lot of recent work to improve caching, async writes, write-ahead, etc. However, in my experience, the read performance is disastrous. I really tried tuning it a lot, but I still feel like I haven’t found the true potential of this.

Ultimately, I had to let it down for the moment because of a lack of time to tune it more. It has a large community and is widely spread, so I will probably end up giving it another chance.

Lustre

Lustre seems like the Holy Grail of distributed filesystems. From Wikipedia: “At the present time, six of the top 10 and more than 60 of the top 100 supercomputers in the world have Lustre file systems in them.”

It appears to have everything I could dream of: speed, scalability, locks, ACLs, you name it.

However, I was never able to try it. It requires dedicated machines with various roles: management, data, file servers (API). This means I would need 4-5 additional machines. On top of that, it needs custom kernel modules.

Definitely on my wish-list, but inaccessible for the moment.

DRBD

DRBD is not cluster solution, it does live backup. Usually, it is used to make a full mirror of a server that can be swapped with the master at any moment, should it fail. This is often used to patch solutions where replication is not built-it. Examples of this are NFS or MySQL. There is a way to setup a 3-nodes solution, but it is far from perfect.

Conclusion

In the end, I found that synced solutions were not reliable enough and distributed solutions were too complex so I chose NFS. My plan is to add a DRBD soon to provide a durability layer but a more serious solution will have to wait. If my cluster scales to the point that NFS can’t suffice to the task, this will mean I will have enough clients, enough money and enough reasons to consider a proper solution.

Comparison
	Maintenance	Complexity	Performance	Scalability	Durability	Consistency	ACLs
Rsync	Low	Low	Very high	Low	High	Low	Yes
Csync2	High	Medium	Very high	Low	High	Low	Yes
NFS	None	None	Medium	None	None	Very high	Enforced
GlusterFS	None	Medium	Low	High	High	Very high	Yes
Lustre	None	Very high	High	Very high	Very high	Very high	Yes
DRBD	None	Medium	n/a	2 or 3	Very high	n/a	Yes

Websites, Hosting and Friends

Pages

Friday, December 6, 2013

GlusterFS performance on different frameworks

Saturday, April 13, 2013

LAMP Cluster — Distributed filesystem