Mobile LLM on Android - Chapter 1
Running large language models (LLM) on mobile is amazing and it has great potential to be the next hot spot in the system research field. As my first-ever blog post, this blog series will share how to run LLMs on mobile devices and, more importantly, empirical evidences to justify a bunch of ‘why’s and ‘how’s.
Introduction
First let’s clarify what we care and what we don’t care in this series of posts.
A. This blog series is research oriented.
We won’t discuss how to implement a beautiful GUI for mobile LLM applications. We won’t discuss which solution works best for a certain AGI business model. More than 99% of this work will be done in command line.
B. This blog series is system-research oriented.
We care about the basics of running any new application on mobile from a system researcher’s point-of-view. What’s the execution model; what’s the performance in terms of latency / throughput / power draw / energy / thermal; how can we improve its performance?
We won’t discuss the quality of LLM’s response (i.e., the accuracy), I’m sure some big tech companies will update model parameters regularly.
C. This blog series is mobile-system-research oriented.
Here, mobile means smart phone. We won’t discuss (edge) single board computers like NVIDIA Jetson. Server side topics will be covered in the future.
The LLM engine is: llama.cpp, an extremely popular (70.7K stars as of mid Jan. 2025) project on Github. llama.cpp is mostly written in CPP.
This page is the first chapter of the series. It will focus on building and deploying llama.cpp to android. The following chapters will discuss some measurements and optimizations.
Stop-and-think
Skip this section if you absolutely know what you are doing.
Q1: Why llama.cpp?
Objectively, llama.cpp is simple to read and debug. Its core tensor library (ggml.c) and model definitions (including layer definitions, operators, schedulers) are written in C/CPP. Nobody couldn’t read C/CPP, right? :-)
Q2: Why android (vs. iOS)?
LLM eats a loooot of DRAM. Android phones are usually cheaper than iphone when we have a constraint of minimal DRAM size.
Q3: Why mobile GPU?
Be aware that mobile GPU isn’t always faster than CPU. However, the hope is mobile GPU is more energy-efficient than CPUs.
From my experience, running LLM inference on multi-core mobile CPUs will be faster than GPU, however, will trigger power throttling much faster. Mobile NPUs are probably more energy-efficient and thermal-friendly, those are future works.
Q4: Why I do not see anything related to OpenCL in the latest llama.cpp codebase?
Unfortunately, OpenCL support has been removed from llama.cpp codebase. The following discussion assumes we start from tag b2202
, which has OpenCL support.
1. Preparation
The following steps have been tested with:
- Pixel 7 / Pixel 7 pro phones (no need to root)
- Ubuntu WSL on windows 11 machine
- Android NDK r27c
The LLM engine will build in WSL and will be deployed to Pixel phone via adb
.
You’ll need to collect three lib files from the phone via adb
. Put them in a folder such as <project_home>/3rd_party/CLlib/
/vendor/lib64/egl/libGLES_mali.so
/vendor/lib/libOpenCL-pixel.so
/vendor/lib/libOpenCL.so
You’ll need to collect OpenCL headers either from the phone, or from github. The simplest way is to copy the entire CL
folder from the MNN project and put it in a folder such as <project_home>/3rd_party/include/CL
Next, git clone CLBlast library to <project_home>/CLBlast
.
Finally, git clone llama.cpp to <project_home>/llama.cpp
. Checkout tag b2202
.
The folder structure looks like:
llama-cpp-home/
├─ 3rd_party/
│ ├─ CLlib/
│ ├─ include/
│ │ ├─ CL/
├─ CLBlast/
├─ llama.cpp/
2. Build
First, build CLBlast
. Go to CLBlast/
and execute the following commands.
export ANDROID_NDK=/some_path_to/android-ndk-r27c
cmake \
-DOPENCL_INCLUDE_DIRS=$(pwd)/../3rd_party/include \
-DOPENCL_LIBRARIES=$(pwd)/../3rd_party/CLlib/libGLES_mali.so \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-21 \
-DCMAKE_C_FLAGS="-march=armv8a" \
-DCMAKE_CXX_FLAGS="-march=armv8a" \
-DCMAKE_INSTALL_PREFIX=$(pwd)/install \
-DVERBOSE=OFF .
make -j4
make install
Inside CLBlast/install
folder, you should see libclblast libraries.
CLBlast/
├─ install/
│ ├─ bin/
│ ├─ include/
│ ├─ lib/
│ │ ├─ libclblast.a <--- this
│ │ ├─ libclblast.so <--- and this
│ │ ├─ cmake/
│ │ │ ├─ CLBlast/
│ │ ├─ pkgconfig/
├─ .../
Next, build llama.cpp
. Go to llama.cpp/
and execute the following commands.
export ANDROID_NDK=/some_path_to/android-ndk-r27c
export CLBLAST_CMAKE=$(pwd)/../CLBlast/install/lib/cmake/CLBlast
rm -r build && mkdir build
cd build
cmake .. \
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-DANDROID_STL=c++_static \
-DANDROID_ABI=arm64-v8a \
-DANDROID_NATIVE_API_LEVEL=android-21 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-march=armv8a" \
-DCMAKE_CXX_FLAGS="-march=armv8a" \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_NATIVE=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_CLBLAST=ON \
-DCLBlast_DIR=$CLBLAST_CMAKE \
make -j4
3. Deployment
We will collect libraries and executables and upload to the phone.
cd llama.cpp
export INSTALL_DIR=llama-cpp-install
mkdir ${INSTALL_DIR}
cp build/bin/llama-bench ${INSTALL_DIR}
cp ../CLBlast/install/lib/libclblast.so ${INSTALL_DIR}
# WSL can directly call exe on windows
adb.exe push --sync ${INSTALL_DIR} /data/local/tmp
adb.exe shell chmod +x /data/local/tmp/${INSTALL_DIR}/*
Search online for gguf
format models on HF. Download some model that fits the phone’s DRAM. For example: tinyllama. Also transfer the model to /data/local/tmp/llama-cpp-install
path on the phone.
# point to OpenCL libraries
export LD_LIBRARY_PATH=/vendor/lib64:/vendor/lib64/egl:$(pwd):$LD_LIBRARY_PATH
export SCRIPT_DIR="/data/local/tmp/llama-cpp-install"
cd $SCRIPT_DIR
MODEL=some_model_name.gguf
$(pwd)/llama-bench \
-t 1 \ # CPU threads
-ngl 99 \ # GPU layers, 99 == all
-m $MODEL \ # model path
-p 0 \ # prefill length
-n 32 \ # decode length
-r 1 # repeat for how many times
4. Discussions
Skip this section if you only care about making llama.cpp work.
Q1: Pixel 7’s SoC is based on
armv8.4a
, why don’t we specifyarmv8.4a
(vs.armv8a
) in cmake?
Building llama.cpp with armv8.4a
doesn’t work on Pixel 7. For some reason, the atomic_store
function in ggml.c, line 17186 is compiled to stlur
instruction that Pixel 7 doesn’t support. If you switch to armv8a
, this function will compile to stlr
instruction (notice, no ‘u’ in the middle) that works.
ASM of the line that triggers illegal instruction
error when compiled with armv8.4a
.
ASM of the same line that works when compiled with armv8a
.