NPU manager

Android 17 and higher supports the Neural Processing Unit (NPU) Manager (com.android.npumanager), which coordinates the allocation and scheduling of NPU resources across system services and application workloads. By moving resource arbitration from custom vendor daemons to the Android platform, the NPU Manager increases predictability, prevents resource starvation, manages thermal boundaries, and enhances overall device performance.

Background and motivation

Before the NPU Manager, apps and system modules submitted workloads directly to vendor drivers or proprietary services. This approach had several drawbacks:

  • Inefficient resource competition: Heavy machine learning workloads (such as Large Language Model (LLM) inference engines or on-device vision systems) competed directly with other high-priority systems for finite NPU resources (such as SRAM, weights memory, and execution channels).
  • System instability: Uncoordinated workloads could trigger thermal throttling, memory page faults, or low memory killer daemon (LMKD) if the demands exceeded hardware capacity.
  • Inefficient prioritization: The system server can't adjust NPU priority in response to context shifts, such as a background task loading a massive model while a latency-sensitive camera pipeline or user assistant is active in the foreground.

The NPU Manager addresses these challenges by acting as a system-level arbiter that gates model loading and dynamically adjusts execution priorities based on current device health and app states.

System architecture

The NPU Manager is implemented as a system service named npu running within the Android framework. The NPU Manager isolates the high-level coordination of scheduling policies from the low-level vendor driver implementation.

The following diagram illustrates the NPU Manager environment layers:

NPU Manager environment layers

Figure 1. NPU Manager environment layers.

Key components

  • Framework API Client (android.npumanager.NpuManager): The entry point used by clients to request model load reservations
  • System Service (npu): A system service that gates model load approvals and manages preemption commands based on scheduling priority rules
  • NPU Scheduling HAL (android.hardware.npu): An AIDL-based interface that relays that relays Android app priorities callbacks between the framework and the driver
  • Vendor driver: A low-level driver that controls the hardware execution blocks and implements low-level prioritization mechanisms

SDK and framework API

Before calling low-level neural network libraries or loading model files, framework clients must interact with the NpuManager service. To do this, clients first define a model load request and then execute the request and approval flow.

Model load request

A model load request is represented by ModelLoadRequest. This object contains:

  • Unique request ID
  • Estimated model size class, such as NPU_MODEL_SIZE_LESS_THAN_1GB or NPU_MODEL_SIZE_GREATER_THAN_2G
  • Intended priority, such as NPU_MODEL_PRIORITY_BACKGROUND, NPU_MODEL_PRIORITY_NORMAL, or NPU_MODEL_PRIORITY_OPPORTUNISTIC

The following code example builds a ModelLoadRequest with a size limit greater than 2 GB and normal execution priority:

ModelLoadRequest request = new ModelLoadRequest.Builder(requestId)
        .setSize(NPU_MODEL_SIZE_GREATER_THAN_2G)
        .setPriority(NPU_MODEL_PRIORITY_NORMAL)
        .build();

Request and approval flow

Clients invoke requestCanLoadModel asynchronously:

npuManager.requestCanLoadModel(request, callback, executor);

When NPU resources are available, the framework responds using ModelLoadRequestCallback with the following events:

  • onCanLoadModel(request, status, listener): Fired when the request is approved. The client receives an NpuManager.ModelLoadStatusListener token. After the client fully loads the model in the driver memory, it must call listener.notifyModelLoaded(request).
  • onRequestUnloadModel(request) or onRequestUnloadModel(request, reason): Fired when the system experiences resource pressure (such as an incoming foreground request or thermal spike) and requires the client to release its model. After reclaiming the NPU resources, the client calls listener.notifyModelUnloaded(request).
  • onModelLoadRequestComplete(request, status): Informs the client of final request lifecycle changes, such as cancellation.

Clients can cancel pending invitations using cancelModelLoad(request).

HAL and vendor integration

To support the NPU Manager, device-specific vendor implementations must conform to the android.hardware.npu AIDL service interfaces.

Scheduling configuration

The system relays app priority using the SchedulingConfig AIDL the SchedulingConfig AIDL structure defined in IScheduling.aidl:

package android.hardware.npu;

@VintfStability
parcelable SchedulingConfig {
    int minPriority;
    int maxPriority;
    int uid;
    int appPriority;
    boolean hasDirectAccess;
    boolean canAttributeOtherUid;
}

Using this structure, the NPU Manager coordinates priority alignments. For example, if a background app submits a high-priority job, the priority is adjusted downwards to prevent interference with foreground graphics.

Task status and profiling

Vendor drivers must report the lifecycle status of NPU execution groups to the manager. WorkInfo tracks the tasks (defined in WorkInfo.aidl):

package android.hardware.npu;

import android.hardware.npu.NpuUuid;

@VintfStability
parcelable WorkInfo {
    int id;
    @nullable NpuUuid groupId;
    int uid;
    int debugPid;
    int originalUid;
    @nullable String debugFeatureId;
    int jobPriority;
    int effectivePriority;
    long timestampMs;
    int deviceNumber;
}

Event debouncing

The scheduling framework supports event debouncing using the debounce_duration_ms parameter within the scheduling callback registration. This avoids log flooding and suppresses rapid notifications, for example, consecutive start and end events for repeating models.

The callback lifecycle states are reported as follows:

  • onWorkRequested: Workload is enqueued by the vendor service.
  • onWorkStarted: Workload execution begins.
    • NPU_START_REASON_INITIAL: First execution run.
    • NPU_START_REASON_RESUMED: Execution resumed after preemption.
  • onWorkEnded: Workload execution ended.
    • NPU_END_REASON_COMPLETED: Successful run completion.
    • NPU_END_REASON_CANCELLED_USER: Cancelled by client.
    • NPU_END_REASON_CANCELLED_SYSTEM: Preempted by system policy.
    • NPU_END_REASON_FAILED: Execution error or driver failure.
    • NPU_END_REASON_PAUSED: Temporarily suspended for higher-priority tasks.

Device readiness and testing

Ensure these configurations are in place before verifying device health.

Application declarations

Clients seeking NPU scheduling prioritization must declare the NPU hardware feature in their AndroidManifest.xml:

<uses-feature android:name="android.hardware.npu" android:required="false" />

For models deployed on newer generations of partner hardware, this declaration might be required for optimal engine creation.

VTS integration testing

NPU HAL implementations can be validated with VTS functional tests, for example, VtsHalNpuSchedulingTargetTest.