Chapter 1
Chapter 2
Chapter 3
Chapter 4
WHITEPAPER

Twilio Microvisor—Architecture and Design Considerations for Modern IoT Infrastructure

Authors: Joe Birr-Pixton, Toby Duckworth, Hugo Fiennes, Peter Hartley, Phil Michaelson-Yeates

WEBINAR: 

Pushing Microcontroller Firmware Updates Over-the-Air Without Bricking Your Device

WATCH ON-DEMAND NOW

Opens in a new window.

What will you learn in this Whitepaper?

The Internet of Things is at a critical juncture. Repeating past mistakes—specifically those related to security and maintenance— will slowly but surely erode confidence in IoT, and its ability to contribute to addressing the myriad of problems that both businesses and the environment face. It’s time to change the way we work on embedded systems.

 

Based on a decade of hands-on experience with the development and maintenance of real world IoT solutions, this white paper explores the unique issues and challenges that connecting devices to the Internet brings.

The paper is organized into two parts. 

 

Part I addresses a broader audience, such as IoT product/project managers and CTOs. It lays out a typical device-side IoT architecture and describes the traditional approach of implementation. It details the associated challenges and develops an argument for a different approach, now made possible through new hardware advancements.

 

Part II addresses the experienced embedded engineer and explains Twilio’s thinking with regard to how the above-mentioned challenges can be effectively addressed with a new architecture.

 

Part 1 – The Challenges of Connecting Devices

1

Key Considerations for building an IoT device

Connected devices vs. unconnected devices

Microcontrollers have been used in products for many decades, and have revolutionized product feature sets, reliability and performance over time. Moore’s law has brought 16- and 32-bit processing to even the smallest and cheapest consumer products, and the availability of this memory and CPU power has enabled the use of real time operating systems (RTOS) where previously developers had to write “bare metal” code.

 

However, the transition from unconnected to connected products—in the context of IoT—has uncovered fundamental issues with how software is built for microcontrollers.
 

Connected device architecture

For IoT devices built around microcontrollers, a typical high level system architecture might look something like the diagram shown here. On the hardware side, there’s a microcontroller connected to both networking hardware (Cellular/Wi-Fi/Ethernet) and to the application hardware—the sensors and actuators used by the IoT application.

Typical high-level system architecture

In order to manage resources and tasks, an off-the-shelf RTOS is typically used. There are many choices here such as FreeRTOS, NuttX, ThreadX, and from a high level they all perform the same tasks—allocation of both memory and processor resources to different tasks within the system. To help decouple higher software layers from the specific hardware involved, there’s usually also a Hardware Abstraction Layer (HAL) which may be built into—or sit alongside—the RTOS, taking care of the actual hardware accesses to perform I/O.

 

Connected devices also need a network stack, typically providing TCP/IP networking. The bottom of the stack talks to the network hardware to exchange packets, and the top of the stack provides stream and datagram APIs. On top of this is layered the security stack, to provide authentication and encryption services used by both cloud communications and FOTA (Firmware Over-The-Air update) services.

 

At the very top, there’s the application, implementing the specific functionality of the device at hand. This talks to the application hardware and the system services and additionally takes care of cloud communication.

 

Usually, the stack has been integrated by the device maker:

 

  • Blue parts in the diagram shown above indicates those that are completely unique to the application
  • Red indicates parts that are more often than not open-source or vendor-provided codebases (such as FreeRTOS, lwip, mBedTLS)
  • Purple indicates areas which may be based onopen-source or vendor code but are often heavily customized for the application. For example, the cloud communication code may be an open- source client for MQTT (a widely used messaging protocol), but with modifications to use customer TLS certificates.

Integration & maintenance challenges

Some pre-integration often exists—for example, Arm provide HAL packaged releases with Mbed OS, network stack and security stack, and Infineon/Cypress provide FreeRTOS, lwIP and Mbed TLS as part of their WICED platform (Wireless Internet Connectivity for Embedded Devices, a platform to enable Wi-Fi and Bluetooth connectivity in system design). Yet the design decisions made by these integrators do not always line up well with the application requirements, resulting in heavy developer customization. That, in turn, comes with the additional complexity of having to merge new releases from the supplier with the existing code base.

 

Merging changes from suppliers is a requirement to maintain system stability and security over the long term—and in IoT deployments that can mean a decade or more—, especially as these packages usually include code that is directly network-facing, which is easiest for an attacker to target. While some vendors provide long term support branches (“LTS”), which retain API compatibility for essential security updates, the definition of “Long Term” is often not compatible with a product’s lifecycle. For example, Mbed TLS has LTS releases which offer security updates without API changes for up to 3 years. But beyond that, the developer would need to integrate a possibly radically different API to maintain a secure product—or heavily compromise a product’s security by continuing to rely on out-of-date code.

 

As always, the more software you’re writing or integrating, the more maintenance you will have to perform on this code over the product’s entire lifecycle. Whereas an unconnected product might comprise 90% application code and 10% third-party code (and ongoing maintenance isn’t required as physical access would be required for any attack), connected products are often 20% application code and 80% third-party code, all of which has to be maintained to protect the user and manufacturer’s reputation.

 

Security design

Besides maintenance, there’s a very real problem with both design and implementation of security components. As with any specialist field, there’s a lot of expertise required to make the correct trade-offs and design decisions when building a connected product—and people with the appropriate skills are rare and hence expensive to hire.
When areas of the product are being architected from scratch— especially parts which may not be serviced adequately by well- supported open source software—the risks associated with a subtly-flawed design decision could be significant.

 

Value and cost predictability

As can be seen in the architecture diagram, there’s a huge amount of software required to build a secure connected product—and most of it does not depend on the application itself. Not only is the time and money spent on integrating and maintaining external components a huge burden to a product’s lifetime costs, it is also essentially invisible to the end user, and doesn’t differentiate the product in the market.

Millions of engineer-hours have gone into reinventing the “connectivity wheel” for every single IoT product that has ever shipped. Complexity, budgets, schedules and lack of relevant domain knowledge has also meant that many of these products suffer from latent security issues just waiting to ruin someone’s day.

2

Solving maintenance issues in IoT

As noted, one of the major challenges with solving the maintenance issue in an MCU design is the close integration between the RTOS and the application. Larger systems such as desktop computers and mobile phones have always had an OS/application split, with the platform supplier, e.g. Microsoft, maintaining the operating system & network stack and providing updates over time to keep it secure.

 

So, could these problems be addressed with a similar OS/ application split applied to embedded systems? There are three issues that crop up:

 

  • Who is responsible for maintaining and updating the operating system, and how can they ensure that updates do not have a detrimental effect on product operation?
  • How much extra cost and complexity does changing OS (or the way the OS is integrated) bring to development?
  • What’s the impact on the Bill of Material (BOM) cost to provide this split?

 

Responsibility for updates

When compared to a desktop or mobile application, embedded IoT applications are vastly different. Most desktop applications—and almost all mobile applications—are human-centric, providing a service or function to the user via processing and connectivity provided by the host device. As such, performance and consistency are appropriate for humans; the user interface might change, a screen might take a couple of seconds longer to appear, or functionality may be degraded if connectivity is not available—but humans quickly adapt.

 

In comparison, an embedded IoT application is I/O-centric and may have non-negotiable performance targets—whether these are for response time, functionality in the event of degraded communication, or power consumption. These targets depend on the specific use case of the device.

 

This different set of developer expectations, coupled with the reality that updates have to be deployed to unattended devices that may be physically entirely inaccessible for their lifetime, result in a very different burden on the shoulders of whoever maintains the devices.

 

Essentially, the developer needs to have confidence that no third party updates will ever break the deployed application. There are two ways the maintainer can help relieve developer concerns:

 

  1. Comprehensive testing. A cursory smoke check or ad-hoc manual quality assurance is not going to uncover the insidious issues that cause problems with embedded systems. All relevant guaranteed behavior and performance has to be tested continuously (so regressions can be addressed well before any release) and in an automated fashion (so that testing is always performed in a consistent manner).
  2. Minimized functionality. Almost as important as testing is minimization of the maintained footprint. The less functionality that is delegated to the third party, the less functionality that could change behavior in the event of an update. In the real world, device performance can vary based on factors out of anyone’s control—RF (Radio Frequency) propagation or network routing, for example. And when diagnosing such issues, being able to remove third-party updates from the list of possible culprits is very helpful.

 

Just as hardware developers are intimately aware of DFM (Design for Manufacture, a set of practices that help products move smoothly from prototype to production with high yield and minimal field failures), software developers are aware of DFT (Design for Testability).

 

In the world of long-lived IoT products, consistent testing over long periods of time is essential. This means that testing must be automated vs. manual, as people working at an organization will change over time. As such, at a minimum, OS and networking code must be defended with a full suite of automated tests: from build-time unit testing to system testing on target hardware, to regular fuzz testing of external interfaces in order to uncover unintended behaviors. This level and duration of DFT and test automation is obviously expensive.

 

Complexity of development and the impact on cost and BOM

Just as developers get comfortable with a particular Instruction Set Architecture (ISA), they also get expertise in an operating system architecture, a set of development tools, and development & debug workflows. Changing any of these components—even for tangible long-term gains—is painful in the short term and can introduce uncertainty in project schedules.

 

In an ideal world, a developer would be able to continue to use their preferred tools and RTOS while still having someone else provide the essential maintenance for long-term support.

 

One advantage of linking the operating system with the application is that it becomes easy to only pull in OS code that the application actually makes use of. This reduces the footprint of the OS and hence reduces the overall memory usage of the product.

 

Adding any functionality to an embedded system—even reliable FOTA—does increase the hardware BOM cost, mainly related to flash and RAM usage. Unfortunately, there’s no real way around this, but the upsides are significant and the incremental costs are generally small, especially when compared to the cost of application development.

 

 

Maintained microvisor vs. maintained operating system

It’s clear from the sections above that attempting to provide a single maintained RTOS for a wide variety of applications is only going to be successful for a subset of developers—those who are already familiar with the chosen OS, and those who do not rely on custom modifications to that OS.

 

If, however, we approach the problem from a different angle and instead look at what services we are trying to provide to the embedded developer, a new solution appears: a hypervisor that runs alongside the developer’s RTOS and application code and insulates it from common attack vectors. Let’s call it a microvisor.

 

The areas which are security-related and hence require long-term maintenance are:

 

  • Secure boot, as attackers may target the boot process, and countermeasures may need to be deployed. A breach here could expose keys, application code, and more.
  • Network stack, as this can be attacked via the local network (and in some cases, via the internet).
  • Security stack,as crypto algorithms will need to evolve over time as protocol bugs are discovered and algorithmic weaknesses are discovered.
  • FOTA & connectivity services, which are built on top of the aforementioned layers and hence will also evolve over time.
  • Network drivers,as issues can appear over time with wireless networks evolving (compatibility updates, fixes for security issues in wireless module firmware, etc.)

 

With appropriate hardware support, such a microvisor can be built—one which claims the necessary peripherals for network support at boot time, and establishes an application-independent connection to the cloud service that provides FOTA updates. But aside from that, it stays largely out of the way of the developer’s application and choice of RTOS.

 

The microvisor can then protect itself from attack, whether it be via hardware tampering or a network interface. It can also protect the developer’s application from a large variety of attacks.

 

 

1.
The modem contained in the scooter scans for the nearest cell tower to connect. This procedure is typically initialized by the base stations and terminated at the MME node of the VPLMN. The MME is one of the central components of the VPLMN. The MME identifies whom the IMSI belongs to by extracting the MNC and MCC fields. If the device is roaming, then the MME will need to talk to the HSS of the HPLMN in order to authenticate and authorize the scooter to connect to the network.
2.
The protocol used for exchanging authentication, authorization, and accounting messages between the MME and the HSS is conducted by the DRA.
3.
Once the HSS gives an `OK`, the MME determines which internet gateway to use for sending internet traffic. In LTE architecture, the PGW is used. In order to figure out which PGW to use, the MME performs a simple DNS query to get the PGW’s IP. This is handled via a simple DNS lookup of the access point name to the HPLMN DNS server.
4.
Once the selection is done, the MME instructs the SGW to establish a GPRS Tunnel Protocol tunnel with the selected PGW.
5.
Once the required tunnels are established, the device can connect to the internet.

Ready for Part II? Download full paper

In the second part of our whitepaper, we are taking a deep dive into what a microvisor architecture looks like; in particular, how Twilio is approaching the design with Twilio Microvisor. We'll cover topics such as peripheral usage, memory, networking, interrupt handling, exception handling, FOTA upgrades, powermanagement, and more.


Sign up now for access to Part II and the full PDF version of our whitepaper.