Dev

    Mobile LLM on Android - Chapter 1

    Running large language models (LLM) on mobile is amazing and it has great potential to be the next hot spot in the system research field. As my first-ever blog post, this blog series will share how to run LLMs on mobile devices and, more importantly, empirical evidences to justify a bunch of ‘why’s and ‘how’s.

    Introduction

    First let’s clarify what we care and what we don’t care in this series of posts.

    A. This blog series is research oriented.

    We won’t discuss how to implement a beautiful GUI for mobile LLM applications. We won’t discuss which solution works best for a certain AGI business model. More than 99% of this work will be done in command line.

    B. This blog series is system-research oriented.

    We care about the basics of running any new application on mobile from a system researcher’s point-of-view. What’s the execution model; what’s the performance in terms of latency / throughput / power draw / energy / thermal; how can we improve its performance?

    We won’t discuss the quality of LLM’s response (i.e., the accuracy), I’m sure some big tech companies will update model parameters regularly.

    C. This blog series is mobile-system-research oriented.

    Here, mobile means smart phone. We won’t discuss (edge) single board computers like NVIDIA Jetson. Server side topics will be covered in the future.

    The LLM engine is: llama.cpp, an extremely popular (70.7K stars as of mid Jan. 2025) project on Github. llama.cpp is mostly written in CPP.

    This page is the first chapter of the series. It will focus on building and deploying llama.cpp to android. The following chapters will discuss some measurements and optimizations.

    Stop-and-think

    Skip this section if you absolutely know what you are doing.

    Q1: Why llama.cpp?

    Objectively, llama.cpp is simple to read and debug. Its core tensor library (ggml.c) and model definitions (including layer definitions, operators, schedulers) are written in C/CPP. Nobody couldn’t read C/CPP, right? :-)

    Q2: Why android (vs. iOS)?

    LLM eats a loooot of DRAM. Android phones are usually cheaper than iphone when we have a constraint of minimal DRAM size.

    Q3: Why mobile GPU?

    Be aware that mobile GPU isn’t always faster than CPU. However, the hope is mobile GPU is more energy-efficient than CPUs.

    From my experience, running LLM inference on multi-core mobile CPUs will be faster than GPU, however, will trigger power throttling much faster. Mobile NPUs are probably more energy-efficient and thermal-friendly, those are future works.

    Q4: Why I do not see anything related to OpenCL in the latest llama.cpp codebase?

    Unfortunately, OpenCL support has been removed from llama.cpp codebase. The following discussion assumes we start from tag b2202, which has OpenCL support.

    1. Preparation

    The following steps have been tested with:

    • Pixel 7 / Pixel 7 pro phones (no need to root)
    • Ubuntu WSL on windows 11 machine
    • Android NDK r27c

    The LLM engine will build in WSL and will be deployed to Pixel phone via adb.

    You’ll need to collect three lib files from the phone via adb. Put them in a folder such as <project_home>/3rd_party/CLlib/

    1. /vendor/lib64/egl/libGLES_mali.so
    2. /vendor/lib/libOpenCL-pixel.so
    3. /vendor/lib/libOpenCL.so

    You’ll need to collect OpenCL headers either from the phone, or from github. The simplest way is to copy the entire CL folder from the MNN project and put it in a folder such as <project_home>/3rd_party/include/CL

    Next, git clone CLBlast library to <project_home>/CLBlast.

    Finally, git clone llama.cpp to <project_home>/llama.cpp. Checkout tag b2202.

    The folder structure looks like:

    llama-cpp-home/
    ├─ 3rd_party/
    │  ├─ CLlib/
    │  ├─ include/
    │  │  ├─ CL/
    ├─ CLBlast/
    ├─ llama.cpp/
    

    2. Build

    First, build CLBlast. Go to CLBlast/ and execute the following commands.

    export ANDROID_NDK=/some_path_to/android-ndk-r27c 
    
    
    cmake \
      -DOPENCL_INCLUDE_DIRS=$(pwd)/../3rd_party/include \
      -DOPENCL_LIBRARIES=$(pwd)/../3rd_party/CLlib/libGLES_mali.so \
      -DBUILD_SHARED_LIBS=OFF \
      -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
      -DANDROID_ABI=arm64-v8a \
      -DANDROID_PLATFORM=android-21 \
      -DCMAKE_C_FLAGS="-march=armv8a" \
      -DCMAKE_CXX_FLAGS="-march=armv8a" \
      -DCMAKE_INSTALL_PREFIX=$(pwd)/install \
      -DVERBOSE=OFF .
    
    
    make -j4
    
    
    make install
    

    Inside CLBlast/install folder, you should see libclblast libraries.

    CLBlast/
    ├─ install/
    │  ├─ bin/
    │  ├─ include/
    │  ├─ lib/
    │  │  ├─ libclblast.a       <--- this
    │  │  ├─ libclblast.so      <--- and this
    │  │  ├─ cmake/
    │  │  │  ├─ CLBlast/
    │  │  ├─ pkgconfig/
    ├─ .../
    

    Next, build llama.cpp. Go to llama.cpp/ and execute the following commands.

    export ANDROID_NDK=/some_path_to/android-ndk-r27c
    export CLBLAST_CMAKE=$(pwd)/../CLBlast/install/lib/cmake/CLBlast
    
    
    rm -r build && mkdir build
    cd build
    
    
    cmake .. \
        -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
        -DANDROID_STL=c++_static \
        -DANDROID_ABI=arm64-v8a \
        -DANDROID_NATIVE_API_LEVEL=android-21 \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_C_FLAGS="-march=armv8a" \
        -DCMAKE_CXX_FLAGS="-march=armv8a" \
        -DBUILD_SHARED_LIBS=OFF \
        -DLLAMA_NATIVE=OFF \
        -DLLAMA_BUILD_TESTS=OFF \
        -DLLAMA_CLBLAST=ON \
        -DCLBlast_DIR=$CLBLAST_CMAKE \
    
    
    make -j4
    

    3. Deployment

    We will collect libraries and executables and upload to the phone.

    cd llama.cpp
    
    
    export INSTALL_DIR=llama-cpp-install
    mkdir ${INSTALL_DIR}
    
    
    cp build/bin/llama-bench ${INSTALL_DIR}
    cp ../CLBlast/install/lib/libclblast.so ${INSTALL_DIR}
    
    
    # WSL can directly call exe on windows
    adb.exe push --sync ${INSTALL_DIR} /data/local/tmp
    adb.exe shell chmod +x /data/local/tmp/${INSTALL_DIR}/*
    

    Search online for gguf format models on HF. Download some model that fits the phone’s DRAM. For example: tinyllama. Also transfer the model to /data/local/tmp/llama-cpp-install path on the phone.

    # point to OpenCL libraries
    export LD_LIBRARY_PATH=/vendor/lib64:/vendor/lib64/egl:$(pwd):$LD_LIBRARY_PATH
    
    
    export SCRIPT_DIR="/data/local/tmp/llama-cpp-install"
    cd $SCRIPT_DIR
    
    
    MODEL=some_model_name.gguf
    
    
    $(pwd)/llama-bench \
        -t 1 \              # CPU threads
        -ngl 99 \           # GPU layers, 99 == all
        -m $MODEL \         # model path
        -p 0 \              # prefill length
        -n 32 \             # decode length
        -r 1                # repeat for how many times
    

    4. Discussions

    Skip this section if you only care about making llama.cpp work.

    Q1: Pixel 7’s SoC is based on armv8.4a, why don’t we specify armv8.4a (vs. armv8a) in cmake?

    Building llama.cpp with armv8.4a doesn’t work on Pixel 7. For some reason, the atomic_store function in ggml.c, line 17186 is compiled to stlur instruction that Pixel 7 doesn’t support. If you switch to armv8a, this function will compile to stlr instruction (notice, no ‘u’ in the middle) that works.

    ASM of the line that triggers illegal instruction error when compiled with armv8.4a.

    ASM of the same line that works when compiled with armv8a.