拿leveldb在FreeBSD下和Linux做了一些对比测试

我找了两台硬件环境基本一样的机器(同一个刀片里面的两台),分别装了FreeBSD和Linux,来测试leveldb在这两个环境下的性能差异。

编译环境:

FreeBSD 9.0: gcc 4.6,libunwind-1.0.1 google-perftools-1.8.3

(libunwind-0.99-beta编译不过去,google-perftools-1.9.1虽然能编译过去,但是最后和db_bench链接完后,运行会core dump)

CentOS Linux: gcc 4.4,google-perftools-1.9.1,libunwind-0.99-beta

leveldb的代码是从git上取的主干的最新版,最后修改时间 2011.11.30。rev c8c5866a86c8

操作系统的详细信息:

一个是FreeBSD 9.0 release

# uname -a

FreeBSD test27.localdomain 9.0-RELEASE FreeBSD 9.0-RELEASE #0: Tue Jan 3 07:46:30 UTC 2012 root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64

文件系统是zfs,da0和da1做的mirror,没有用SAS卡自带的raid控制器,因为没驱动。sysctl的参数都是系统默认的,未做任何tunning。

一个是CentOS release 5.7 (Final)

# uname -a
Linux test26 2.6.18-194.el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

文件系统是ext3。

所选用的benchmark程序是leveldb自带的db_bench。这个程序其实有一定的问题,因为在同一个机器上反复运行,性能结果会越来越好。

运行参数: ./db_bench --db=/home/t3 --write_buffer_size=107374182 --cache_size=107374182 --threads=4

测试说明:

下面如果无特别说明,key的长度是16字节,Values的长度是100字节,测试条数N = 1百万。

名字 说明
fillseq 顺序写(按key的顺序),N条,异步模式
fillsync 随机写(不按key的顺序),N/100条,同步模式
fillrandom 随机写(不按key的顺序),N条,异步模式
overwrite 改写N条,随机序,异步模式
readrandom 随机读N次
readrandom(紧接着,第二次) 无意义
readseq 顺序读N次
readreverse 按key的逆序读N次
compact 压缩整个数据库。deleted and overwritten versions are discarded,and the data is rearranged to reduce the cost of operations
readrandom 随机读N次
readseq 顺序读N次
readreverse 按key的逆序读N次
fill100K 随机写(不按key的顺序),N/1000条,每条长度100K,异步模式
crc32c 对4K大小的数据反复做CRC32
snappycomp 用snappy做压缩
snappyuncomp 用snappy解压缩
acquireload load N*1000次

Linux:

# ./db_bench --db=/home/t3 --write_buffer_size=107374182 --cache_size=107374182 --threads=4

LevelDB: version 1.2
Date: Tue Jan 17 12:48:28 2012
CPU: 16 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPUCache: 12288 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)

fillseq : 38.842 micros/op; 11.3 MB/s
fillsync : 41449.316 micros/op; 0.0 MB/s (1000 ops)
fillrandom : 42.102 micros/op; 10.3 MB/s
overwrite : 43.050 micros/op; 10.1 MB/s
readrandom : 12.359 micros/op;
readrandom : 8.789 micros/op;
readseq : 0.506 micros/op; 874.5 MB/s
readreverse : 0.806 micros/op; 497.5 MB/s
compact : 18211470.000 micros/op;
readrandom : 6.197 micros/op;
readseq : 0.430 micros/op; 964.3 MB/s
readreverse : 0.737 micros/op; 567.2 MB/s
fill100K : 3470.828 micros/op; 108.9 MB/s (1000 ops)
crc32c : 5.251 micros/op; 2972.6 MB/s (4K per op)
snappycomp : 11.369 micros/op; 1372.1 MB/s (output: 55.1%)
snappyuncomp : 1.806 micros/op; 8630.8 MB/s
acquireload : 0.869 micros/op; (each op is 1000 loads)

FreeBSD:

# ./db_bench --db=/home/t8 --write_buffer_size=107374182 --cache_size=107374182 --threads=4

LevelDB: version 1.2
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)

fillseq : 37.417 micros/op; 11.7 MB/s
fillsync : 81650.791 micros/op; 0.0 MB/s (1000 ops)
fillrandom : 45.759 micros/op; 9.6 MB/s
overwrite : 43.303 micros/op; 9.9 MB/s
readrandom : 13.796 micros/op;
readrandom : 9.536 micros/op;
readseq : 0.567 micros/op; 644.3 MB/s
readreverse : 1.042 micros/op; 363.7 MB/s
compact : 13727166.000 micros/op;
readrandom : 6.372 micros/op;
readseq : 0.509 micros/op; 636.9 MB/s
readreverse : 0.771 micros/op; 533.2 MB/s
fill100K : 3733.620 micros/op; 101.1 MB/s (1000 ops)
crc32c : 5.252 micros/op; 2963.7 MB/s (4K per op)
snappycomp : 10.450 micros/op; 1468.3 MB/s (output: 55.1%)
snappyuncomp : 1.658 micros/op; 9387.1 MB/s
acquireload : 0.853 micros/op; (each op is 1000 loads)

比较显著的是fillsync和readseq的差异。于是我就准备单独提取出这两个,做测试。然后发现一个有意思的现象,就是,增大write_buffer_size,会造成readseq的效率显著降低。

这是Linux下的测试结果:

[root@test26 leveldb]# ./db_bench --db=/home/t27 --write_buffer_size=52428800 --cache_size=52428800 --threads=4 --benchmarks=fillseq,readseq,compact,readseq
LevelDB: version 1.2
Date: Tue Jan 17 13:36:36 2012
CPU: 16 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPUCache: 12288 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)

fillseq : 37.209 micros/op; 11.7 MB/s
readseq : 0.611 micros/op; 722.4 MB/s
compact : 6895956.000 micros/op;
readseq : 0.461 micros/op; 910.5 MB/s
[root@test26 leveldb]# ./db_bench --db=/home/t28 --write_buffer_size=419430400 --cache_size=52428800 --threads=4 --benchmarks=fillseq,readseq,compact,readseq
LevelDB: version 1.2
Date: Tue Jan 17 13:38:32 2012
CPU: 16 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPUCache: 12288 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)

fillseq : 37.258 micros/op; 11.8 MB/s
readseq : 1.653 micros/op; 211.3 MB/s
compact : 27502101.000 micros/op;
readseq : 0.461 micros/op; 960.6 MB/s

这是FreeBSD下的测试结果:

[root@test27 ~/packages/leveldb]# ./db_bench --db=/home/t27 --write_buffer_size=52428800 --cache_size=52428800 --threads=4 --benchmarks=fillseq,readseq,compact,readseq
LevelDB: version 1.2
Date: Tue Jan 17 05:41:40 2012
CPU: 16 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPUCache:
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)

fillseq : 34.314 micros/op; 12.3 MB/s
readseq : 0.829 micros/op; 505.8 MB/s
compact : 2340297.000 micros/op;
readseq : 0.511 micros/op; 721.6 MB/s
[root@test27 ~/packages/leveldb]# ./db_bench --db=/home/t28 --write_buffer_size=419430400 --cache_size=52428800 --threads=4 --benchmarks=fillseq,readseq,compact,readseq
LevelDB: version 1.2
Date: Tue Jan 17 05:42:29 2012
CPU: 16 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPUCache:
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)

fillseq : 32.211 micros/op; 13.1 MB/s
readseq : 1.948 micros/op; 188.8 MB/s
compact : 30027946.000 micros/op;
readseq : 0.491 micros/op; 901.8 MB/s

我真是不明白。这是在系统空闲内存十分充足的情况下测试的,十分奇怪。

然后,我换成单线程模式,加大num,测试fillsync的效率,这是Linux,

# ./db_bench --db=/home/t30 --write_buffer_size=52428800 --cache_size=52428800 --benchmarks=fillsync --num=10000000
LevelDB: version 1.2
Date: Tue Jan 17 13:53:51 2012
CPU: 16 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPUCache: 12288 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 10000000
RawSize: 1106.3 MB (estimated)
FileSize: 629.4 MB (estimated)

fillsync : 11453.367 micros/op; 0.0 MB/s (10000 ops)

这是FreeBSD,

# ./db_bench --db=/home/t30 --write_buffer_size=52428800 --cache_size=52428800 --benchmarks=fillsync --num=10000000
LevelDB: version 1.2
Date: Tue Jan 17 05:54:50 2012
CPU: 16 * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPUCache:
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 10000000
RawSize: 1106.3 MB (estimated)
FileSize: 629.4 MB (estimated)

fillsync : 25343.126 micros/op; 0.0 MB/s (10000 ops)

不过,这个问题,要看你怎么看了。如果仅仅是fillrandom,那么FreeBSD和Linux之间没有明显的差别。但是如果加上sync就不一样了。leveldb的文档中这么说,

// If true, the write will be flushed from the operating system
// buffer cache (by calling WritableFile::Sync()) before the write
// is considered complete. If this flag is true, writes will be
// slower.
//
// If this flag is false, and the machine crashes, some recent
// writes may be lost. Note that if it is just the process that
// crashes (i.e., the machine does not reboot), no writes will be
// lost even if sync==false.
//
// In other words, a DB write with sync==false has similar
// crash semantics as the "write()" system call. A DB write
// with sync==true has similar crash semantics to a "write()"
// system call followed by "fsync()".
//
// Default: false
bool sync;

首先,我十分强调,自己程序崩溃的概率,要比OS崩溃的概率大多了(虽然panpan不是很赞同)。所以我觉得,只要数据送给了OS,就安全了。那么如果sync=true呢?文档中说,它约等于write+fsync。事实上底层的实现是munmap+msync(,,MS_SYNC)。

freebsd上,msync的man页的末尾这么说:

BUGS
The msync() system call is obsolete since BSD implements a coherent file
system buffer cache. However, it may be used to associate dirty VM pages
with file system buffers and thus cause them to be flushed to physical
media sooner rather than later.

其实对于Linux也是一样,msync(,,MS_SYNC)就等于fsync,下面就看vm怎么做了。未必真的刷下去了。尤其是,一般来说,我们都会打开硬盘的write cache。既然这不是一个空操作,这么耗时,那么在特别的应用中,开还是比不开好吧? 反正我现在用不到,算了。结论是:虽然它们有性能差异,但是谁都说不清自己到底做了什么,没做什么,能保证什么。所以,没有比较的价值。

此博客中的热门博文

少写代码,多读别人写的代码

在windows下使用llvm+clang

tensorflow distributed runtime初窥