AI at the Edge: Architecture, Benefits and Tradeoffs

02 May, 2025

@Source: thenewstack.io

There is a shift taking place in the physical architecture in which AI software runs. AI is moving toward the edge. Device-centric edge computing has become common for use cases such as medical devices, supply chain apparatus, shop-floor manufacturing and the Internet of Things (IoT). However, until very recently, the software that runs those devices used traditional, structured programming. Decision-making is constrained by static logic executed within conditional statements, particular to the given programming language. Embedding AI in an edge device takes its capabilities to the next level. Cameras with embedded AI can make decisions based solely on the image being viewed, for example, distinguishing between an adult and a child, or a truck and a motorcycle, and then acting accordingly. Before AI was enabled at the edge, distinguishing objects in a complex environment required enormous amounts of time-consuming back-and-forth over the network. Once an edge device is AI-enabled, processing becomes instantaneous. AI at the edge is a game-changer, but it is not a technical panacea. There’s a lot to consider when implementing it. Let’s look at the architectural basics of running AI in general, along with the particulars of running AI at the edge. We’ll go over the fundamental distinction between using AI at the near edge and far edge. Finally, we’ll look at the benefits that AI-enabled edge computing provides, as well as the tradeoffs. Let’s start with the architectural basics. Architectural Basics of AI at the Edge Figure 1 illustrates the basic architecture of an application that uses a large language model (LLM). The application’s control flow is broken up into the following steps. In the case of an application represented as an API, a web server exposes a set of HTTP endpoints to which a user submits natural language prompts. (1). The web server interacts with a client application that takes prompts directly from the user or as input from the API web server. (2) The client application interacts with an intermediate library such as llama.cpp, Pandas, Tensorflow, and PyTorch. (3) An intermediate library abstracts hardware complexity and provides a programming interface to the given LLM. It will do the work of loading the model into memory while also converting natural language prompts into tokens. In the case of natural language processing, tokenization is the process of converting sentences into small language units, such as individual characters, words and word groups, punctuation marks, and special symbols. In the case of images and video, tokenization breaks the image or video frame into regions of pixels that are called patches. Once a prompt has been tokenized, it is submitted to the given LLM, which works with software frameworks that interact with the CPU or GPU to determine a valid response. (4) Typically, processing a prompt on an LLM is a resource-intensive undertaking that can slow down responsiveness. Therefore, an accelerator can be used to increase processing performance. (4a) An accelerator increases performance by optimizing the parallel processing power of a GPU or across a number of GPUs, augmenting memory management, and applying a variety of quantization techniques that will reduce model size and increase inference speed. Using an accelerator is optional but recommended for production-level operations. Figure 2 below shows the architecture for an online AI service. The AI installation runs on racks of servers in a data center accessed via the Internet. Users submit a prompt from their desktop computer to an internet provider, which forwards the prompt to a particular API endpoint in the data center, based on a targeted URL that represents the service. AI processing is conducted according to the Client App -> Intermediate Library -> Accelerator -> LLM workflow described previously in Figure 1. Using AI services incurs the burden of network latency. Also, at the enterprise level, AI services charge customers for the service. The benefit of using an AI service is that customers do not need to support the overhead that goes with working with LLMs. However, it is entirely possible for users to emulate the standard AI architecture on their local machine. All that’s required is a machine with enough memory, storage and computing capacity to support the AI architecture. Also, users need to install an appropriate client app, an intermediate library, an LLM, and an optional accelerator. (See Figure 3.) As shown in Figure 4 below, users can all run AI on cell phones, tablets or home assistants such as Amazon’s Alexa, provided the device has the computing power necessary to run a particular model. LLMs such as Gemma 2B, TinyLlama, and Helium-1 can be run on mobile devices, usually on those with 8 GB of RAM or more. Typically, these models are designed to do most of their work on the device on which they are installed. However, there are situations where an application might take a hybrid approach to AI processing, one in which both local and external services, such as another model or other type of computing services, are used. LLMs can also be installed on dedicated devices, such as a computer-enabled camera or, as mentioned above, a digital assistant. Figure 5 below shows an AI architecture in which a video camera is connected to a computing device such as a Raspberry Pi. The Raspberry Pi has an application that captures images streaming from the camera and passes the stream onto the LLM via an intermediate library for processing. Once a device processes data using AI, it uses the responses from the LLM locally on the device itself, or the responses can be shared with other devices on the local area network to which the camera apparatus is connected. Also, those responses can be forwarded to external services in the cloud to satisfy other uses. So far, we’ve described three basic architectures for working with LLMs. The first is the architecture for implementing LLMs for AI services accessed remotely over the internet. The second is for using LLM hosted on a local machine or mobile device. The third is to use LLMs on dedicated digital devices such as a video camera or AI assistant operating within a local area network. The last two architectures are typical of what we can think of as edge computing. However, when it comes to actually implementing an AI on the edge, there is an added dimension that needs to be considered. This has to do with the distinction between using AI at the near edge and the far edge. Running at the Near Edge vs. Far Edge Figure 6 below illustrates a near-edge and far-edge network topology. Typically, the near edge is made up of on-premises or regional servers. Such locations can have their own content delivery network (CDN) caches, security frameworks, and performance-enhancing mechanisms. Near-edge installations have a limited number of distribution locations. Examples of near-edge facilities are a cell tower that services a large number of mobile devices, or a central data center in a factory that controls its various internal digital devices and machines. The near edge tends to consist of devices installed at locations remote from a central computing facility, such as a cell tower or a rack of servers hosted internally on-premises. The devices at the far edge are typically dedicated to a specific purpose, such as video surveillance of a particular area or controlling some type of machinery, such as an automated workstation on an assembly line. The devices have just enough computing capacity to support their dedicated activities. Thus, when using AI and LLMs at the near and far edges, one needs to consider two factors: purpose and computing capacity. As mentioned previously, devices used at the far end tend to have a specific purpose and a computing capacity appropriate for that purpose, for example, keeping track of automobiles entering and exiting a video-surveilled garage. This purpose can be achieved by connecting a video camera to a Raspberry Pi that has a client application installed that uses a model with a small footprint, such as the TinyLlama LLM. The video camera/Raspberry Pi installation can determine an event, such as “a black SUV entered the garage at 14:35 at 10 mph speed.” This event data can then be sent to servers on the near edge that are intended to determine entry and exit behavior among all vehicles at all garages under surveillance. Also, the video data collected daily by the far-edge device can be transmitted to a central data center equipped with ample storage, thus reducing the likelihood of storage overload on the far-edge device. In short, when dividing resources between near-edge and far-edge devices, the far-edge device should have only as much computing overhead as its capacity can accommodate and is necessary for it to achieve its purpose. Behavior that is extraneous to the purpose of the far-edge device should be implemented at the near edge, if possible — and, if not, at an external resource hosted on the cloud. The Benefits of Running LLMs at the Edge Running LLM models at both the near and far edges has several benefits. The most prominent are as follows. Running an LLM locally on the edge keeps an organization’s data and AI activities private. Using a public AI service runs the risk of prompt injection vulnerability, in which the service exposes an organization’s sensitive data to malicious parties. Also, all the data, both natural language prompt inputs and the associated response, is exposed for anyone to see. Running an LLM at the near or far edge protects this data. Faster Processing An AI application running locally on an edge machine in which the model is stored locally will be faster due to physical efficiency. Prompts do not have to run over the network to get a response. Everything that’s needed is local. The performance speed is constrained only by the physical capabilities of the local hardware. Lower Costs Most enterprise-grade AI services charge a fee for software and hardware usage. For example, using the Google Cloud Video Intelligence API to do facial detection costs 10 U.S. cents per minute, with the first 1,000 minutes free, which translates into $908 to run facial recognition for a week’s worth of 24/7 surveillance. Of course, volume discounts are usually applied for high-volume usage. Still, at even $500 a week per video stream, these costs add up in no time. Thus, running onboard video recognition on a per-camera basis, at either the far or near edge, incurs significant savings. Similar savings can be achieved for audio recognition and various other AI applications that require continuous use. However, although using AI at the edge has a number of benefits, there are also tradeoffs. Understanding the Tradeoffs Running AI at the near or far edge involves a number of tradeoffs. The following describes some of the more prominent ones. Significant Hardware Requirements Running an LLM is a resource hog. A minimal installation that uses an Intel 15/i7 CPU with 16 GB of RAM and 10 GB of disk storage is adequate enough to do very rough image recognition, for example, spotting a particular person in a group photo or processing a barcode scan for a price in a database stored on the local device. However, doing tasks such as creating a high-quality photo or video based on a set of user requirements or processing video streams for complex outcomes requires a lot more computing power. For example, creating a five-minute, AI-generated music video with predefined cartoon characters would require an NVIDIA RTX 5090 (32GB VRAM), 64 GB of RAM, 2 TB NVMe SSD, and an Intel i9 or AMD Threadripper. The cost of such a setup can run between roughly $3,600 and $7,500. Of course, a music video is not an appropriate use case for running AI on the edge. But it does give you a sense of the range of costs involved. The important thing to understand about running AI at the edge is that hardware requirements are dependent on the purpose of the edge device. Typically, running AI on the edge is not intended for generic use. Hardware specificity will also become more pronounced as AI becomes more prevalent as a dedicated feature in embedded devices such as automobile sensors and factory floor machines. In short, when it comes to running AI on the edge, hardware considerations are significant. You need to find exactly the right hardware configuration to meet the need at hand. There are few one-size-fits-all solutions. More Complex Programming Writing programs geared to running AI at the edge involves a level of complexity that goes beyond writing traditional business applications that run from a data center and are accessed over the internet. Typically, edge devices are dedicated to particular sensors and audio-visual apparatus. Thus, not only do developers need to accommodate standard I/O processes such as reading and writing data to disk and accommodating the dynamics of network throughput, but they also have to deal with the particulars of working with the onboard device(s) and, for example, doing video capture efficiently or controlling a mechanism in an automated assembly line or home appliance. It’s a special type of programming skill that is rarely found among developers who write business applications. In addition, developers writing code for LLMs that run on the edge need to be aware of model optimization to ensure their code runs efficiently. Model optimization involves techniques such as hyperparameter tuning, which is about determining the best configuration of the billions of parameters that can be set for a given model. Another aspect of model optimization is affecting model quantization. Developers work with model quantization to adjust the precision of numeric data and the computational burdens placed on a model, thus optimizing overall performance to meet the need at hand. Chip-Specific Programming As mentioned many times in this article, running AI at the edge is chipset-specific; this is particularly applicable when using an accelerator. For example, to increase performance in an edge device configuration that uses an Arm CPU, developers can use a NEON Accelerator, but to increase performance in an X86 device, developers can work with hardware acceleration extensions such as AVX or AMX. In addition, GPUs have their own variety of accelerators, such as CUDA for NVidia GPUs or ROCm, which is an open-source platform that can support a variety of GPUs, including those made by AMD. The takeaway is that companies that intend to write software for running AI on the edge will need to be very aware of the specific chipsets in play. Unlike typical business applications that tend to be limited to targeting 64-bit AMD/Intel chips or variations of Arm chipsets for mobile devices, running AI on the edge can encounter any number of CPU/GPU combinations that need to be supported. Companies will need to have a solid grounding in the physical aspects of edge computing and the expertise to code and configure software environments to those specific chipsets. It’s a more detailed, less generic undertaking. Greater Burden on Deployment Maintenance Software development at the enterprise level is a continuous cycle of refactoring and redeployment. This is as true for business applications running on the web as for AI running on the edge. Code is always changing, and it needs to be deployed; only with AI on the edge, the deployment burden can be greater. According to Michel Burger, CTO of mimik, a company heavily engaged in development for AI and the edge, “One of the deep secrets of AI at the edge is that you have to be extremely good at device management.” Consider the following: GPT-Neo 2.7B LLM, which is an open-source version of ChatGPT, requires about 10.7GB of storage. It can take about 10 minutes to download a model of that size at 500 mbps, which is a typical mid-range download option for a business account from an Internet service provider such as AT&T. Now, imagine you have 50 edge devices distributed throughout a factory floor running that mode. As history can attest, there will come a time when those devices will need to be upgraded with a new version of the model. Even if the new model is deployed to each device simultaneously, it will take at least 10 minutes for all the devices to be upgraded. If the factory runs on a 24-hour basis, this means that to do the upgrade, the factory will need to be inoperative for at least 10 minutes of download time, if not longer. This can have a consequential impact. Granted, running a model such as GPT-Neo 2.7B, which is intended for generic use, is probably overkill in an AI on the edge scenario. A smaller model that can support the specific purpose of the edge devices will be more appropriate to the need at hand, and download times can be shorter. Nonetheless, no matter the size of the model, downloading upgrades will be part of the maintenance cycle, and these upgrades will take time that needs to be anticipated. Also, there will be times when the application software and hardware need to be upgraded as well. It’s not as simple as upgrading a desktop or mobile application or even upgrading an enterprise-scale web application running in a data center under Kubernetes. The infrastructure for upgrading desktop and mobile applications has been around for a while, as has the mechanism for upgrading applications running under Kubernetes. A standard for deploying AI on the edge is still in its embryonic stage. Until the techniques mature, companies will have to come up with their own deployment methodologies for AI on the edge that are reliable and accurate. Putting It All Together Running AI on the edge is the next step in the evolution of making artificial intelligence mainstream. Moving LLMs beyond the data center and onto devices running on the near and far edge adds a new dimension to distributed application architecture. Running AI on the edge can reduce costs and improve performance, but it does come with tradeoffs. The hardware that runs on the edge is more constrained. Thus, special consideration must be given to how resources are used. Also, given that many smaller devices are distributed among any number of places in a given location, a company needs to have considerable expertise in device maintenance, both in terms of software and hardware. Application and model deployment have yet to be standardized. Nonetheless, the opportunities for running AI on the edge are significant. The possibilities for productive use of AI technology at the edge are limited only by the ability of companies and the software development community to imagine practical use cases. The hardware is readily available. The software libraries needed to support innovation are but a download away. All that’s left is to connect the dots and use AI at the edge to make products that will make a better life for all of us.