Solvers

The SmartFace Embedded solvers are binaries processing the NN model inference and model-related pre-processing and post-processing operation. For each NN model, there is a specific solver.

Our NN models can be accelerated on a variety of HW, including CPUs, GPUs and NPUs. Depending on the target platform and inference engine we distinguish different solvers.

Currently, SmartFace Embedded supports NN acceleration using the following inference engines.

In SFE Toolkit use the function sfeSolverCreate to load a correct solver.

Example

std::string solver_face_detect = SOLVER_FACE_DETECT;
SFESolver detector_solver{};
// Load detection solver
SFEError error =
    sfeSolverCreate(solver_face_detect.c_str(), &detector_solver);
utils::checkError(error);

To configure solver-specific parameters, use the function sfeSolverCreateWithParameters instead.

In SFE Stream Processor you set the paths to the solvers to be loaded and used via solvers section of settings.yaml:


ONNX Runtime

ONNX Runtime inference engine enables inference of NN models on a variety of HW using different so-called execution providers. ONNX Runtime solver comes with the suffix onnxrt.solver in its name.

ONNX Runtime solvers are currently supported for Windows x86, Linux x86, Jetson ARM64 and Android architectures.

SFE Toolkit currently uses the following ONNX Runtime versions:

  • 1.13.1 - Linux, Windows and NVidia Jetson with Jetpack 5.0 and higher
  • 1.12.0 - NVidia Jetson with Jetpack 4.6
  • 1.7.2 - NVidia Jetson with Jetpack 4.4

The ONNX Runtime execution provider can be configured using a solver parameter “runtime_provider” or environment variable ONNXRUNTIME_SOLVER_RUNTIME_PROVIDER. Supported runtime/execution providers are:

  • “cpu” - Linux, Windows, Jetson, Android
  • “cuda” - Linux, Windows, Jetson
  • “tensorrt” - Linux, Windows, Jetson Note: Environment variable overrides solver parameter.

ONNXRuntime solvers depend on the onnxruntime library. For Windows, Microsoft C and C++ (MSVC) runtime libraries are required to be installed see this page.

CPU

The default ONNX Runtime CPU Execution Provider is MLAS.

The performance and CPU utilization can be configured using the following solver parameters and environment variables:

  • solver parameter “inter_threads” or env variable ONNXRUNTIME_SOLVER_INTER_THREADS ** Sets the number of threads used to parallelize the execution of the graph (across nodes). ** Default value is 0 which means the default number of threads will be used. ** If “parallel” execution mode is turned on, this sets the maximum number of threads to use to run them in parallel. ** If “sequential” execution mode is enabled this value is ignored, it acts as if it was set to 1.
  • solver parameter “intra_threads” or env variable ONNXRUNTIME_SOLVER_INTRA_THREADS ** Sets the number of threads used to parallelize the execution within nodes ** Default value is 0 which means the default number of threads will be used.
  • solver parameter “execution_mode” or env variable ONNXRUNTIME_SOLVER_EXECUTION_MODE ** Controls whether the operators are executed in parallel or sequentially ** “parallel” - execute operators in the graph in parallel. ** “sequential” - execute operators in the graph sequentially. ** default value is “parallel” Note: Environment variable overrides solver parameter.

CUDA

The CUDA Execution Provider enables hardware-accelerated computation on Nvidia CUDA-enabled GPUs and NVidia Jetson platforms

The supported CUDA and cuDNN version requirements are documented here.

You can specify the ID of a CUDA device where the NN model inference will be executed by setting a solver parameter “device_id” or env variable ONNXRUNTIME_SOLVER_DEVICE_ID:

  • default value is 0 Note: Environment variable overrides solver parameter.

TensorRT

With the TensorRT execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic GPU acceleration.

The TensorRT execution provider in the ONNX Runtime makes use of NVIDIA’s TensorRT Deep Learning inferencing engine to accelerate the ONNX model in their family of GPUs.

The supported TensorRT and CUDA versions are documented here.

You can configure TensorRT settings by environment variables. Find more details here.

To decrease the ONNX model load time, you can enable the TensorRT engine caching with the following environment variables:

  • ORT_TENSORRT_ENGINE_CACHE_ENABLE: Enable TensorRT engine caching. The default value is 0 (disabled), value 1 means caching is enabled.
  • ORT_TENSORRT_CACHE_PATH: Specify the path for the TensorRT engine and profile files if ORT_TENSORRT_ENGINE_CACHE_ENABLE is 1.

Rockchip NPU

Rockchip NPU is used for NN model acceleration on Rockchip RV1109 and RV1126.

Rockchip solver comes with the suffix rockchip.solver in its name.

Ambarella CVFlow

Ambarella CVFlow is used for NN model acceleration on Ambarella CV25, CV22 and CV2 chips.

Ambarella solver comes with the suffix ambarella.solver in its name.

SmartFace Embedded currently supports the following version of Ambarella SDK:

  • Ambarella SDK 3.0 - dependency on EazyAI library (libeazyai.so)
  • Ambarella SDK 2.5.8 - dependency on nnctrl (v0.3.0) and cavalry_mem (v0.0.6) libraries Please note there are different packages with different solvers required for various combinations of CV architecture and Ambarella SDK version

HailoRT

HailoRT is used for NN model acceleration on Axiomtek RSC101 AI box, but also other edge devices using Hailo-8 chip.

Hailo solver comes with the suffix hailo.solver in its name.

Follow instructions to install HailoRT.


TensorFlow Lite

TensorFlow Lite inference engine is used for NN model acceleration on NXP’s NPU hardware i.MX 8 series.

TensorFlow Lite solver comes with the suffix tflite.solver in its name.

Acceleration on the NPU is handled by the VX delegate plugin. VX Delegate enables accelerating the inference on on-chip hardware accelerator on i.MX 8 series. The VX Delegate directly uses the hardware accelerator driver (OpenVX with extension) to fully utilize the accelerator capabilities.

SmartFace Embedded currently uses Tensorflow Lite 2.9.1.

Supported TFLite solver parameters

  • tflite_solver.delegate
  • tflite_solver.num_threads
  • tflite_solver.vx_delegate.device_id
  • tflite_solver.vx_delegate.cache
  • tflite_solver.vx_delegate.cache_file_path

Supported TFLite solver environment variables

  • TFLITE_SOLVER_DELEGATE
  • TFLITE_SOLVER_NUM_THREADS
  • TFLITE_SOLVER_VX_DELEGATE_DEVICE_ID
  • TFLITE_SOLVER_VX_DELEGATE_ENABLE_CACHING
  • TFLITE_SOLVER_VX_DELEGATE_CACHE_FILE_PATH

Note: Environment variable overrides solver parameter.

Delegate parameter

Enables users to specify a delegate that should be used for inference. For example, GPU delegate or VX delegate can be specified. Possible values of this parameter, whether it is set via generic solver API or using an environment variable, are:

  • “cpu”,
  • “gpu”,
  • “nnapi”,
  • “vx”.

The default value is cpu

Num Threads parameter

Allows users to specify the number of threads to be used for model inference. Currently, this parameter only affects the CPU delegate and does nothing if specified along with a different delegate. Possible values are non-float numbers starting from -1 to infinity`. Special behavior is triggered when the following values are provided:

  • -1 - let the tflite engine choose the most suitable number of threads.
  • 0 - Multithreading disabled.

The default value is -1, which means that the Tensorflow-lite engine will decide how many threads it’s going to use.

Vx Delegate device ID

In an environment with multiple VX NPUs, allows you to specify a device ID that a solver should use for inference.

The default value is 0.

Vx Delegate caching enabled and cache file path

Enables you to turn on the VX model caching. This is useful when the first VX model inference takes too long. It is capable of caching its models into files and reusing them in another instance of the process. You can turn this VX feature on by setting tflite_solver.vx_delegate.cache to “true”. By default, it will cache a model into a file with path: /tmp/tflite_solver.vxcache.

Default behavior: caching is disabled


SmartFace Embedded Input and Output Solvers

SmartFace Embedded also support other solvers for processing various inputs, like Camera input, Imager input, GStreamer and for dsaplying the video output in terminal or window.

These solvers are not avaialble for all supported platforms.

Camera input solver

This solver opens a camera device with a given index and outputs image as output tensor.

Parameters

  • camera_index : u32 (index of linux camera device) Default:0
  • camera_width: u32 (width of the camera resolution) Default:640
  • camera_height: u32 (height of the camera resolution) Default:480
  • camera_fps: u32 (framerate of the camera) Default: 5
  • camera_format: string (frame format of the camera) Default: YUYV
    • “MJPEG”
    • “YUYV”
    • “GRAY”
    • “RAWRGB”
    • “NV12”

You can check available cameras using command gst-device-monitor-1.0


GStreamer input solver

This solver runs a GStreamer pipline, consumes its output and creates a tensor.

Default gst pipeline is: v4l2src device={gst_video_device} ! video/x-raw,width={gst_width},height={gst_height},framerate=10/1 ! videoconvert ! video/x-raw,format=BGR | appsink name={app_sink} drop=True max-buffers=1 emit-signals=True async=false sync=false

Parameters

  • gst_pipeline: string (complete gst pipeline that overrides every other parameter)
  • gst_width: u32 (width of the camera resolution)
  • gst_height: u32 (height of the camera resolution)
  • gst_app_sink_name: string (app sink name)
  • gst_video_device: string (linux video device e.g. /dev/video1)

A GStreamer pipeline can be used to obtain BGR frames from USBCam, Video file or RTSP stream.

Rules to follow when providing a custom Gstreamer pipeline:

  • Don’t specify any *sink for a pipeline, i.e appsink, autovideosink, …
  • Always make sure the pipeline outputs a video/x-raw,format=BGR frame
  • Check if width/height properties are correctly set
  • Check if your detector model accepts the same input shape as the pipeline provides

See also GStreamer pipelines for more information on how to configure the pipeline properly.

Image input solver

It simulates camera stream by reading images from a given folder. The default folder is set as

const DEFAULT_PATH: &str = ".";

The images are sorted by name before streaming.

where the images are streamed into Window Output Solver (default option in in_out_example).

Parameters

  • path : string (path to override default path for images)

Terminal output solver

Displays a video stream in a terminal window. You can set various parameters of a video output. Default parameters are:

const DEFAULT_TRANSPARENT: bool = false;
const DEFAULT_ABSOLUTE_OFFSET: bool = true;
const DEFAULT_X: u16 = 0;
const DEFAULT_Y: i16 = 0;
const DEFAULT_RESTORE_CURSOR: bool = true;
const DEFAULT_WIDTH: Option<u32> = None;
const DEFAULT_HEIGHT: Option<u32> = None;
const DEFAULT_TRUECOLOR: bool = true;
const DEFAULT_USE_KITTY: bool = true;
const DEFAULT_USE_ITERM: bool = true;

Parameters

  • transparent: bool (Enable true transparency instead of checkerboard background.)
  • absolute_offset: bool (Make the x and y offset be relative to the top left terminal corner. If false, the y offset is relative to the cursor’s position)
  • x: u16 (X offset)
  • y: i16, (Y offset. Can be negative only when absolute_offset is false)
  • restore_cursor: bool (Take a note of cursor position before printing and restore it when finished)
  • width: u32 (Image width)
  • height: u32 (Image height)
  • truecolor: bool (Use truecolor if the terminal supports it)
  • use_kitty: bool (Use Kitty protocol if the terminal supports it)
  • use_iterm: bool (Use iTerm protocol if the terminal supports it)

Window output solver

Outpus a video stream to a simple window.

Parameters

  • title: string (Window title)
  • width: usize (Window width)
  • height: usize (Window height)