MangoPI MQ Dual boot speed

MangoPi MQ Dual board Here we discuss getting the MangoPI MQ Dual to boot to a prompt in 1.5 seconds

The MangoPI MQ Dual is a tiny board based around the AllWinner T113-S3

This is an interesting processor because of its cost (less than $5 USD in medium quantities at time of writing), integrated memory (128MB), and relatively full Linux support.

Because of the 'small' nature of this CPU, it lends itself to more deeply embedded projects. However, when used in this manner the relatively 'heavy' nature of Linux can become an issue, as it has impacts on build sizes, boot time etc...

In the instructions below we're building a completely open source system from a combination of U-Boot, Linux kernel & Buildroot. This is all based on mainline upstream sources, which gives a huge amount of assurance about on-going support and the quality of development. The configuration of these pieces has been adjusted to minimise size (and functionality to some degree as a consequence) to demonstrate a plausible minimal Linux system as a starting point for building something. The focus is on boot speed.

Timing breakdown

The unit boots in the following rough phases/durations:

  1. 250ms: Loading U-Boot from the SD card, initialising DDR, running U-Boot
  2. 600ms: Loading Linux and root filesystem from the SD card and starting the Linux kernel
  3. 400ms: Linux kernel initialising all drivers and mounting the initramfs root filesystem
  4. 200ms: Userspace execution

The biggest aspects of this are really reading content from the SD card. Unfortunately the U-Boot driver for the SunXI MMC device inside the AllWinner T113-S3 does not give us great performance. We are currently seeing ~6MiB per second, on a card that should be capable of many times that. If further improvements were needed, this would be the area to optimise.

Boot speed tips & tricks

There isn't anything particularly special about the image that is constructed here - the fundamental C source code has not been change in any way. What we have done is tweak the configurations in such a way to remove features we don't want, and adjust some default values. Here are the main areas that have been adjusted:

Baudrates
Embedded Linux systems typically use a serial port as their primary console for development. This is normally not exposed on the final product. For various historic reasons, the default for this is set to 115200 baud. This is horribly slow. Modern systems have no problems with much higher baudrates on UARTs, some up to 12 Megabaud. For our purposes, 1.5 Megabaud is a good balance between speed, compatibility and standardisation.

We also add the quiet parameter to the Linux kernel bootargs to reduce its console output during boot. See ArchLinux wiki for more details. Full kernel logs are still available via dmesg if required

Configuration
U-Boot by default uses a 3 second boot delay to give you the opportunity to stop it at the prompt to run adhoc boot commands. While retaining the ability to do this is helpful, there is no need to have a delay at all. U-Boot will look for characters sent in the time between power on and the completion of the board initialisation. Any character sent then will interrupt the default boot if the boot delay value is set to 0. On the Mango Pi MQ this still gives us a few hundred milliseconds, which is sufficient.

This has been adjusted by setting CONFIG_BOOTDELAY=0 in the U-Boot configuration.

Storage access
This covers two things - don't hit the storage more frequently than you need, and don't store more than you need. To avoid hitting the storage too frequently we have dropped the U-Boot programmable environment. So you cannot make non-volatile changes to U-Boot without rebuilding U-Boot itself. This can actually be a benefit to reliability on embedded systems as any issues in the U-Boot environment can render the unit unbootable. So removing the ability to change it reduces some risk. It also means we don't have to look to storage to find a U-Boot environment file.

We have combined the kernel, device tree and filesystem into a single image which is loaded into memory in a continuous read.

We have reduced the functionality in the filesystem & kernel to the bare minimum, thus reducing image size (3.5MB at this stage).

Reduced functionality
We have removed as much as possible from the Linux kernel & Buildroot filesystem to bring the image size down.

In Buildroot we are using a statically built musl based busybox executable. This removes the need for libc.so, by integrating it with busybox. However if more binaries are added, this would get duplicated, eventually no longer being worth while.

In Linux we have turned off module support, and removed various subsystems that we're not using (wifi, sound etc...)

Other options

If the non-volatile root filesystem makes development awkward, moving over the using the EXT4 partition directly would be an option. This could also have some benefits to boot speed as only the needed files would be accessed during boot.

Hardware Details
Build Instructions

Pre-built binaries can be obtained from AndreRenaud/buildroot-mangopi-mini/releases.

  1. Clone the Buildroot repository
    $ git clone https://github.com/AndreRenaud/buildroot-mangopi-mini.git
  2. Build the image
    $ cd buildroot-mangopi-mini
    buildroot-mangopi-mini$ make mangopi_mini_defconfig
    buildroot-mangopi-mini$ make
    ... [verbose output removed] ...
    buildroot-mangopi-mini$ $ ls -l output/images/
    total 18308
    -rwxr-xr-x 1 andre andre     17823 Sep 17 09:46 mango-mini.dtb
    -rw-r--r-- 1 andre andre   3482347 Sep 17 09:47 mango-mini.ub
    -rw-r--r-- 1 andre andre   1039872 Sep 17 09:47 rootfs.cpio
    -rw-r--r-- 1 andre andre    493039 Sep 17 09:47 rootfs.cpio.zst
    -rw-r--r-- 1 andre andre 268435456 Sep 17 09:47 rootfs.ext4
    -rw-r--r-- 1 andre andre   1556480 Sep 17 09:47 rootfs.tar
    -rw-r--r-- 1 andre andre 269484032 Sep 17 09:47 sdcard.img
    -rw-r--r-- 1 andre andre    521812 Sep 17 09:46 u-boot.bin
    -rw-r--r-- 1 andre andre    554644 Sep 17 09:46 u-boot-sunxi-with-spl.bin
    -rw-r--r-- 1 andre andre   2969648 Sep 17 09:46 zImage
                        
  3. Flash a blank microSD card. Note: be careful that you have the correct device. Use Balena Etcher if you're not sure
    $ sudo dd if=output/images/sdcard.img of=/dev/sda bs=1M
  4. Insert the uSD card into the slot on the MangoPI MQ, and power it on via the USB OTG header
  5. A new USB serial device should show up on your computer now. Connect to this using a serial program (picocom, minicom putty etc...), at 115200 baud and you should see a login prompt (username: "root", password: "" - blank)
    $ picocom -b 115200 /dev/ttyACM0
    picocom v3.1
    ... [verbose output removed] ...
    
    Type [C-a] [C-h] to see available commands
    Terminal ready
    
    Welcome to Buildroot
    buildroot login:
Timed Boot Log This boot log is obtained by monitoring the hardware serial port of the Mango Pi on P8. As we need the Linux kernel to initialise the USB peripheral, we cannot get the full boot console over the USB port. This bootlog is generated using GrabSerial. The first two column shows the absolute time from boot at which the line of text started. The second column shows the delta from the previous line. All values are in floating point seconds.
[0.000002 0.000002]
[0.000218 0.000216] U-Boot SPL 2024.10-rc1 (Sep 17 2024 - 11:21:52 +1200)
[0.000802 0.000584] DRAM: 128 MiB
[0.003019 0.002217] Trying to boot from MMC1
[0.199349 0.196330]
[0.199475 0.000126]
[0.199490 0.000015] U-Boot 2024.10-rc1 (Sep 17 2024 - 11:21:52 +1200) Allwinner Technology
[0.200530 0.001040]
[0.200546 0.000016] CPU:   Allwinner RModel: MangoPi MQ-R-T113
[0.201054 0.000508] DRAM:
[0.224077 0.023023] Core:  39 devices, 17 uclasses, devicetree: separate
[0.224673 0.000596] WDT:   Not starting watchdog@20500a0
[0.225035 0.000362] MMC:   mmc@4020000: 0, mmc@4021000: 1
[0.231221 0.006186] Loading Environment from nowhere... OK
[0.231650 0.000429] In:    serial@2500c00
[0.231819 0.000169] Out:   serial@2500c00
[0.231979 0.000160] Err:   serial@2500c00
[0.233635 0.001656] Net:   No ethernet found.
[0.235225 0.001590] Hit any key to stop autoboot:  0
[0.837554 0.602329] 3482343 bytes read in 578 ms (5.7 MiB/s)
[0.838717 0.001163] ## Loading kernel from FIT Image at 45000000 ...
[0.839962 0.001245]    Using 'config' configuration
[0.840907 0.000945]    Trying 'kernel' kernel subimage
[0.841317 0.000410]      Description:  Linux kernel
[0.841675 0.000358]      Type:         Kernel Image
[0.842026 0.000351]      Compression:  uncompressed
[0.842383 0.000357]      Data Start:   0x450000c8
[0.842728 0.000345]      Data Size:    2969640 Bytes = 2.8 MiB
[0.843210 0.000482]      Architecture: ARM
[0.843477 0.000267]      OS:           Linux
[0.843765 0.000288]      Load Address: 0x42000000
[0.844101 0.000336]      Entry Point:  0x42000000
[0.844451 0.000350]    Verifying Hash Integrity ... OK
[0.844864 0.000413] ## Loading ramdisk from FIT Image at 45000000 ...
[0.845434 0.000570]    Using 'config' configuration
[0.845811 0.000377]    Trying 'ramdisk' ramdisk subimage
[0.846217 0.000406]      Description:  buildroot
[0.846515 0.000298]      Type:         RAMDisk Image
[0.846792 0.000277]      Compression:  uncompressed
[0.847060 0.000268]      Data Start:   0x452d97a8
[0.847311 0.000251]      Data Size:    493041 Bytes = 481.5 KiB
[0.847678 0.000367]      Architecture: ARM
[0.847873 0.000195]      OS:           Linux
[0.848084 0.000211]      Load Address: unavailable
[0.848452 0.000368]      Entry Point:  unavailable
[0.848744 0.000292]    Verifying Hash Integrity ... OK
[0.849067 0.000323] ## Loading fdt from FIT Image at 45000000 ...
[0.849483 0.000416]    Using 'config' configuration
[0.849766 0.000283]    Trying 'fdt' fdt subimage
[0.850030 0.000264]      Description:  Flattened Device Tree blob
[0.850456 0.000426]      Type:         Flat Device Tree
[0.850762 0.000306]      Compression:  uncompressed
[0.851083 0.000321]      Data Start:   0x452d519c
[0.851346 0.000263]      Data Size:    17823 Bytes = 17.4 KiB
[0.851700 0.000354]      Architecture: ARM
[0.851899 0.000199]    Verifying Hash Integrity ... OK
[0.852197 0.000298]    Booting using the fdt blob at 0x452d519c
[0.852549 0.000352]    Loading Kernel Image to 42000000
[0.852805 0.000256]    Loading Ramdisk to 47cf2000, end 47d6a5f1 ... OK
[0.853170 0.000365]    Loading Device Tree to 47cea000, end 47cf159e ... OK
[0.860799 0.007629]
[0.860817 0.000018] Starting kernel ...
[0.860983 0.000166]
[1.188615 0.327632] [    0.001454] /cpus/cpu@0 missing clock-frequency property
[1.209622 0.021007] [    0.001513] /cpus/cpu@1 missing lock-frequency property
[1.364434 0.154812] Saving 256 bits of non-creditable seed for next boot
[1.369073 0.004639] Starting syslogd: OK
[1.379926 0.010853] Starting klogd: OK
[1.383684 0.003758] Running sysctl: OK
[1.398598 0.014914] Starting network: OK
[1.416731 0.018133] Starting crond: OK
[1.521718 0.104987]
[1.521754 0.000036] Welcome to Buildroot
[1.521923 0.000169] buildroot login:
        
Benchmarks & Hardware Testing

Mhz

# mhz
count=394821 us50=19992 us250=98404 diff=78412 cpu_MHz=1007.042

Dhrystone

# dhrystone 50000000
Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without 'register' attribute

Execution starts, 50000000 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob:            5
        should be:   5
Bool_Glob:           1
        should be:   1
Ch_1_Glob:           A
        should be:   A
Ch_2_Glob:           B
        should be:   B
Arr_1_Glob[8]:       7
        should be:   7
Arr_2_Glob[8][7]:    50000010
        should be:   Number_Of_Runs + 10
Ptr_Glob->
  Ptr_Comp:          -1224957888
        should be:   (implementation-dependent)
  Discr:             0
        should be:   0
  Enum_Comp:         2
        should be:   2
  Int_Comp:          17
        should be:   17
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
  Ptr_Comp:          -1224957888
        should be:   (implementation-dependent), same as above
  Discr:             0
        should be:   0
  Enum_Comp:         1
        should be:   1
  Int_Comp:          18
        should be:   18
  Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
        should be:   DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc:           5
        should be:   5
Int_2_Loc:           13
        should be:   13
Int_3_Loc:           7
        should be:   7
Enum_Loc:            1
        should be:   1
Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
        should be:   DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
        should be:   DHRYSTONE PROGRAM, 2'ND STRING
          
Microseconds for one run through Dhrystone:    0.3
Dhrystones per Second:                      3302510.0 

Whetstone

# whetstone 500000

Loops: 500000, Iterations: 1, Duration: 24 sec.
C Converted Double Precision Whetstones: 2083.3 MIPS

RamSMP

Note: These tests take quite a while to run. Also, due to the limited RAM available on this cpu, the kernel out-of-memory (OOM) killer will kill off these tests if run at full memory usage. So we have shrunk the usage a bit in some tests.
# ramsmp -b 1
  RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
  
  8Gb per pass mode, 2 processes
  
  INTEGER & WRITING         1 Kb block: 10015.62 MB/s
  INTEGER & WRITING         2 Kb block: 9667.79 MB/s
  INTEGER & WRITING         4 Kb block: 6742.15 MB/s
  INTEGER & WRITING         8 Kb block: 7311.32 MB/s
  INTEGER & WRITING        16 Kb block: 7025.77 MB/s
  INTEGER & WRITING        32 Kb block: 7796.73 MB/s
  INTEGER & WRITING        64 Kb block: 3464.04 MB/s
  INTEGER & WRITING       128 Kb block: 3688.21 MB/s
  INTEGER & WRITING       256 Kb block: 3245.94 MB/s
  INTEGER & WRITING       512 Kb block: 3031.65 MB/s
  INTEGER & WRITING      1024 Kb block: 2927.86 MB/s
  INTEGER & WRITING      2048 Kb block: 2905.54 MB/s
  INTEGER & WRITING      4096 Kb block: 2892.17 MB/s
  INTEGER & WRITING      8192 Kb block: 2885.02 MB/s
  INTEGER & WRITING     16384 Kb block: 2874.86 MB/s
  INTEGER & WRITING     32768 Kb block: 2858.14 MB/s

  # ramsmp -b 2
  RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
  
  8Gb per pass mode, 2 processes
  
  INTEGER & READING         1 Kb block: 7583.74 MB/s
  INTEGER & READING         2 Kb block: 7639.73 MB/s
  INTEGER & READING         4 Kb block: 7668.49 MB/s
  INTEGER & READING         8 Kb block: 7681.87 MB/s
  INTEGER & READING        16 Kb block: 7665.73 MB/s
  INTEGER & READING        32 Kb block: 6562.60 MB/s
  INTEGER & READING        64 Kb block: 5887.65 MB/s
  INTEGER & READING       128 Kb block: 4137.91 MB/s
  INTEGER & READING       256 Kb block: 2374.20 MB/s
  INTEGER & READING       512 Kb block: 2507.72 MB/s
  INTEGER & READING      1024 Kb block: 2589.00 MB/s
  INTEGER & READING      2048 Kb block: 2544.78 MB/s
  INTEGER & READING      4096 Kb block: 2529.61 MB/s
  INTEGER & READING      8192 Kb block: 2528.81 MB/s
  INTEGER & READING     16384 Kb block: 2525.78 MB/s
  INTEGER & READING     32768 Kb block: 2523.67 MB/s

  # ramsmp -b 3 -m 16 -g 1
  RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
  
  1Gb per pass mode, 2 processes
  
  INTEGER   Copy:      1308.69 MB/s
  INTEGER   Scale:     1632.49 MB/s
  INTEGER   Add:       956.67 MB/s
  INTEGER   Triad:     914.97 MB/s
  ---
  INTEGER   AVERAGE:   1203.21 MB/s
  
  # ramsmp -b 4 -m 16 -g 1
  RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
  
  1Gb per pass mode, 2 processes
  
  FL-POINT & WRITING        1 Kb block: 10963.46 MB/s
  FL-POINT & WRITING        2 Kb block: 10784.90 MB/s
  FL-POINT & WRITING        4 Kb block: 7000.56 MB/s
  FL-POINT & WRITING        8 Kb block: 8262.41 MB/s
  FL-POINT & WRITING       16 Kb block: 7400.96 MB/s
  FL-POINT & WRITING       32 Kb block: 7745.64 MB/s
  FL-POINT & WRITING       64 Kb block: 3309.73 MB/s
  FL-POINT & WRITING      128 Kb block: 3616.63 MB/s
  FL-POINT & WRITING      256 Kb block: 3220.54 MB/s
  FL-POINT & WRITING      512 Kb block: 3073.26 MB/s
  FL-POINT & WRITING     1024 Kb block: 2971.77 MB/s
  FL-POINT & WRITING     2048 Kb block: 2961.22 MB/s
  FL-POINT & WRITING     4096 Kb block: 2934.71 MB/s
  FL-POINT & WRITING     8192 Kb block: 2914.26 MB/s
  FL-POINT & WRITING    16384 Kb block: 2868.20 MB/s

  # ramsmp -b 5 -m 16 -g 1
  RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
  
  1Gb per pass mode, 2 processes
  
  FL-POINT & READING        1 Kb block: 6505.88 MB/s
  FL-POINT & READING        2 Kb block: 6536.87 MB/s
  FL-POINT & READING        4 Kb block: 6552.30 MB/s
  FL-POINT & READING        8 Kb block: 6552.54 MB/s
  FL-POINT & READING       16 Kb block: 6537.00 MB/s
  FL-POINT & READING       32 Kb block: 5605.81 MB/s
  FL-POINT & READING       64 Kb block: 3599.00 MB/s
  FL-POINT & READING      128 Kb block: 2706.26 MB/s
  FL-POINT & READING      256 Kb block: 2836.77 MB/s
  FL-POINT & READING      512 Kb block: 2459.71 MB/s
  FL-POINT & READING     1024 Kb block: 2506.30 MB/s
  FL-POINT & READING     2048 Kb block: 2490.05 MB/s
  FL-POINT & READING     4096 Kb block: 2498.23 MB/s
  FL-POINT & READING     8192 Kb block: 2514.75 MB/s
  FL-POINT & READING    16384 Kb block: 2535.22 MB/s

  # ramsmp -b 6 -m 16 -g 1
  RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
  
  1Gb per pass mode, 2 processes
  
  FL-POINT  Copy:      1022.98 MB/s
  FL-POINT  Scale:     858.64 MB/s
  FL-POINT  Add:       1579.09 MB/s
  FL-POINT  Triad:     929.05 MB/s
  ---
  FL-POINT  AVERAGE:   1097.44 MB/s
  

Wifi

This assumes that the iperf3 server is running elsewhere on the network, and that the MangoPi wifi has been configured appropriately.
# iperf3 -c 192.168.2.140
Connecting to host 192.168.2.140, port 5201
[  5] local 192.168.2.38 port 43774 connected to 192.168.2.140 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.12 MBytes  17.8 Mbits/sec    0    137 KBytes
[  5]   1.00-2.00   sec  4.12 MBytes  34.6 Mbits/sec    0    303 KBytes
[  5]   2.00-3.00   sec  4.38 MBytes  36.7 Mbits/sec    0    342 KBytes
[  5]   3.00-4.00   sec  4.50 MBytes  37.7 Mbits/sec    0    342 KBytes
[  5]   4.00-5.00   sec  4.00 MBytes  33.6 Mbits/sec    0    342 KBytes
[  5]   5.00-6.00   sec  4.38 MBytes  36.7 Mbits/sec    0    361 KBytes
[  5]   6.00-7.00   sec  4.50 MBytes  37.7 Mbits/sec    0    361 KBytes
[  5]   7.00-8.00   sec  4.38 MBytes  36.7 Mbits/sec    0    380 KBytes
[  5]   8.00-9.00   sec  4.00 MBytes  33.6 Mbits/sec    0    380 KBytes
[  5]   9.00-10.00  sec  4.12 MBytes  34.6 Mbits/sec    0    380 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  40.5 MBytes  34.0 Mbits/sec    0             sender
[  5]   0.00-10.09  sec  40.0 MBytes  33.2 Mbits/sec                  receiver
          
iperf Done.