Lab Notes: Something weird is happening with sysbench

Hello

Hello. It is 2 PM. This morning, I took Eamon to see my cousin Sasha. We got some coffee, and Eamon was very cute, asking the bartenders their names and thanking them.

What am I doing?

Yesterday, I ran initial sysbench MySQL benchmarks and found that my Intel Mac can handle about 10,000 queries per second without ruining the query duration (keeping it around ~60ms). I reached 10k qps by twiddling with how many threads sysbench used: 16 sysbench threads took me to 10k qps.

I didn’t write this down, but yesterday I noticed that the disk was 100% utilized even at 1,600 qps. It was also 100% utilized at 10,000 qps. What I am confused about is how I was able to squeeze more performance out of MySQL despite the disk already being fully utilized at 1,600 qps.

An experiment involving dead uncles and elevators

I do have a theory! Despite the disk being 100% utilized, it doesn’t mean it was at 100% capacity. How can that be? Let me illustrate using an elevator.

Say that your rich uncle died, and you have inherited his elevator. I don’t know! And, I guess, now you are going to run experiments on this elevator. Say that the elevator has a capacity of 10 people. Your friends are extremely enthusiastic people, and you convince them to try an experiment. You are all on the bottom floor, and you send 1 friend up. It takes 30 seconds. The friend gets off on the top floor. The elevator comes back down, which takes 30 seconds, and immediately you send another 1 friend up. You repeat. The elevator is at 100% utilization: it is always busy. The throughput is 1 friend per minute. Now, instead of sending 1 friend at a time, you send 10. Now, your utilization is still 100%, but your throughput has increased to 10 friends per minute.

I don’t know if this is handled in the OS or in the disk itself, but I am sure there is something similar happening for writes.

How can I verify this theory?

I can use iostat to monitor the disk. (What is iostat? It’s a monitoring utility for IO, like disks). What I will look for is the utilization %, the throughput (MB per second) [this will be indicated by the wkB/s field], and the request size (this will correspond to how many friends we are putting in the elevator at the same time) [this will be the wareq-sz field].

I expect the utilization to remain at 100%, but the throughput will increase, which will happen thanks to us stuffing more friends in the elevator (wareq-sz will go up).

The specific command I will use is iostat -xh 1 sda. -x is a flag which prints a ton more info than default, and -h “humanizes” it, e.g. converting 1,000 KB into 1 MB. 1 tells it to refresh the output every second, and sda limits it to the 1 disk I’m interested in.

Alright, so, I have that running in one window. In another, I’ll run the 1-thread benchmark: docker compose run sysbench --threads=1 oltp_read_write run.

Alright, check it out. The %util is 100%, wkB/s is 45 MB, and the wareq-sz is 22.5 KB. We’ll compare these to the next run.

(Here’s the full output)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.9%    0.0%    3.5%   10.5%    0.0%   83.1%

     r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k sda

     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
 2062.00     45.4M     0.00   0.0%    0.48    22.5k sda

     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k sda

     f/s f_await  aqu-sz  %util Device
  576.00    1.30    1.73 100.0% sda

Now, let’s run 16 threads: docker compose run sysbench --threads=1 oltp_read_write run.

Aha! The %util is still 100%, but we are now writing 59 MB per second, and the wareq-sz is 25.4 KB.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.5%    0.0%   10.7%    8.9%    0.0%   62.0%

     r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k sda

     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
 2380.00     59.0M     0.00   0.0%    0.45    25.4k sda

     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k sda

     f/s f_await  aqu-sz  %util Device
  631.00    1.23    1.85 100.0% sda

This means that, as expected, we are able to increase the disk throughput even though the utilization was already 100%.

What is the maximum write speed of my disk?

Even if I increase the number of sysbench threads (so that I can send more queries per second to MySQL), the disk write speed doesn’t go above 60 MB per second. This is probably the disk’s maximum write speed. However, I’d like to confirm this. The reason is that oltp_read_write is not a pure test of disk write speed. For example, as I increase the number of sysbench threads, MySQL has to spend more resources juggling the connections, which could prevent it from fully utilizing the disk.

I need to exercise the disk outside the context of MySQL. I can do this using sysbench’s fileio test. Note that I am not truly going to test the disk’s write performance. Instead, I am testing the OS write performance. They are different, but I can’t explain yet what the difference is. Maybe later I can explore that.

Figuring out which benchmark to run

Anyway, the fileio benchmark has a number of modes: seqwr, seqrewr, seqrd, rndrd, rndwr, rndrw. I can tell that some of these correspond to sequential vs random reads/writes, but otherwise I can only guess what the modes mean. First, I need to find documentation or read the source. The help for fileio is not helpful.

$ docker compose run sysbench fileio help

sysbench 1.0.17 (using bundled LuaJIT 2.1.0-beta2)

fileio options:
  --file-num=N                  number of files to create [128]
  --file-block-size=N           block size to use in all IO operations [16384]
  --file-total-size=SIZE        total size of files to create [2G]
  --file-test-mode=STRING       test mode {seqwr, seqrewr, seqrd, rndrd, rndwr, rndrw}
  --file-io-mode=STRING         file operations mode {sync,async,mmap} [sync]
  --file-async-backlog=N        number of asynchronous operatons to queue per thread [128]
  --file-extra-flags=[LIST,...] list of additional flags to use to open files {sync,dsync,direct} []
  --file-fsync-freq=N           do fsync() after this number of requests (0 - don't use fsync()) [100]
  --file-fsync-all[=on|off]     do fsync() after each write operation [off]
  --file-fsync-end[=on|off]     do fsync() at the end of test [on]
  --file-fsync-mode=STRING      which method to use for synchronization {fsync, fdatasync} [fsync]
  --file-merged-requests=N      merge at most this number of IO requests if possible (0 - don't merge) [0]
  --file-rw-ratio=N             reads/writes ratio for combined test [1.5]

The main --help output isn’t helpful either. Nice! Alright, what about the source? grepping the repo for seqrewr showed me a list containing descriptive test names:

    if (!strcmp(mode, "seqwr"))
      test_mode = MODE_WRITE;
    else if (!strcmp(mode, "seqrewr"))
      test_mode = MODE_REWRITE;
    else if (!strcmp(mode, "seqrd"))
      test_mode = MODE_READ;
    else if (!strcmp(mode, "rndrd"))
      test_mode = MODE_RND_READ;
    else if (!strcmp(mode, "rndwr"))
      test_mode = MODE_RND_WRITE;
    else if (!strcmp(mode, "rndrw"))
      test_mode = MODE_RND_RW;

What I want is sequential writes (seqwr), then. Why?

I want writes only, because my oltp_read_write benchmarks only wrote to disk, didn’t read (I didn’t really expand on this previously, but I think this is because the entire database fits in memory, so the reads don’t hit the disk. I can explore this later!)
I want to find the maximum write speed of the disk, and sequential writes are going to be faster than random writes because less time will be spent seeking.

Running the sequential write fileio benchmark

In sysbench, for a benchmark you do 3 things: prepare it, actually run it, and then clean it up. I’ve been omitting this from my outputs. For some tests, you don’t actually need to clean up between runs, because the test cleans up after itself. For example, the oltp_read_write test deletes every row it creates. It would probably be best to actually clean up and prepare between every test, though… yeah, maybe I’ll start doing that. I can just use a small bash script to do so.

#!/usr/bin/env bash

docker compose run sysbench $* prepare
docker compose run sysbench $* run
docker compose run sysbench $* cleanup

$* expands to all positional arguments. I can use this script like so: ./run_test.sh --threads=16 oltp_read_write.

Anyway, let’s actually run the fileio test: ./run_test.sh fileio --file-test-mode=seqwr.

Oh, wow. The actual write throughput is WAY higher! 187 MB per second.

sysbench 1.0.17 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Initializing worker threads...

Threads started!


File operations:
    reads/s:                      0.00
    writes/s:                     11394.67
    fsyncs/s:                     14596.37

Throughput:
    read, MiB/s:                  0.00
    written, MiB/s:               178.04

General statistics:
    total time:                          10.0032s
    total number of events:              259904

Latency (ms):
         min:                                    0.00
         avg:                                    0.04
         max:                                   31.00
         95th percentile:                        0.01
         sum:                                 9928.30

Threads fairness:
    events (avg/stddev):           259904.0000/0.00
    execution time (avg/stddev):   9.9283/0.00

Alright, that makes sense, because these are sequential writes. What about random writes?

Ah, that’s strange. The prepare step clearly creates the files necessary for the test:

128 files, 16384Kb each, 2048Mb total
Creating files for the test...
Extra file open flags: (none)
Creating file test_file.0
Creating file test_file.1
Creating file test_file.2
...

And yet the run step complains that it can’t find them.

FATAL: Cannot open file 'test_file.0' errno = 2 (No such file or directory)

Strange. Indeed, I cannot see the test_files anywhere. Huh. Well, where are these files created during the prepare step?

Maybe I can use strace to figure out where the files are created. To help keep output readable, I’ll limit it to just 1 file: docker compose run sysbench fileio prepare --file-test-mode=rndwr --file-num=1.

Ah, wait, I don’t need to run anything. I completely forgot that I’m doing all of this in Docker. The files created during the prepare step are created in their own container, and are not persisted in any way that would expose them to the next invocation of the sysbench service.

OK, so what I can do is rewrite my run_test.sh:

#!/usr/bin/env bash

sysbench $* prepare
sysbench $* run
sysbench $* cleanup

I can “upload” the file into the container, without modifying the Dockerfile, like so:

  sysbench:
    image: severalnines/sysbench:latest
    volumes:
      - /home/dmitry/dev/labs/sysbench-mysql-basic/scripts:/scripts

And then I can execute the prepare/run steps inside a single container: docker compose run sysbench scripts/run_test.sh fileio --file-test-mode=seqwr.

So… does it work? Yes!

File operations:
    reads/s:                      0.00
    writes/s:                     396.02
    fsyncs/s:                     514.83

Throughput:
    read, MiB/s:                  0.00
    written, MiB/s:               6.19

General statistics:
    total time:                          10.0986s
    total number of events:              9072

Latency (ms):
         min:                                    0.00
         avg:                                    1.10
         max:                                    8.89
         95th percentile:                        3.49
         sum:                                 9993.76

Threads fairness:
    events (avg/stddev):           9072.0000/0.00
    execution time (avg/stddev):   9.9938/0.00

Huh. It’s only doing ~6 MB/s. Very slow. Although, strangely the disk write throughput from iostat is extremely different: ~50 MB_s.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.1%    0.0%    1.8%   11.0%    0.0%   86.1%

     r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k sda

     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
 2256.00     53.2M     0.00   0.0%    0.46    24.2k sda

     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k sda

     f/s f_await  aqu-sz  %util Device
  643.00    1.21    1.83 100.0% sda

Let me see if there’s a similar discrepancy between iostat’s throughput and sysbench’s reported throughput for the seqwr benchmark.

Running docker compose run sysbench scripts/run_test.sh fileio --file-test-mode=seqwr --threads=1.

Sysbench’s throughput is 178 MB/s like before. Iostat’s: 200 MB’s. I mean, this is pretty close. This is not nearly as much of a discrepancy as iostat’s. Let me double-check iostat’s output against another monitoring utility (atop). No, atop agrees with iostat!

Something mysterious is going on. I’ll dive into this on Monday. Have a good weekend!

Meanwhile, we can safely say that 60 MB/s is not the disk’s maximum write speed. It is more like 200 MB/s.

Dmitry Mazin

"cyberdemon.org is a cool domain"