CCTV Focus

Cloud and P2P in Video Surveillance: Why Mobile Apps Deliver Lower Latency Than Browsers

VMS Software

The War for Seconds: Why Latency Became the Key Metric in Video Surveillance

Ten years ago, video surveillance was measured by different criteria: resolution, field of view, frame rate, archive depth. Latency hardly mattered, because users were either watching recordings or a local live stream within a single network. The camera, recorder, and monitor all lived in the same building, on the same switch, in the same physical reality. The internet was secondary if it was involved at all.
Once video surveillance moved to the cloud, the rules changed. Cameras went online, users moved to mobile devices, and networks became unpredictable. LTE, 5G, public Wi-Fi, corporate proxies, carrier CG-NAT - all of this turned live video delivery into a problem where every second of delay became critical. When a security guard or business owner opens a camera feed in a mobile app, they expect to see what’s happening now, not what happened ten seconds ago. At that point, video surveillance stops being “recording” and becomes an interactive interface to the real world.
This is where classic protocols started to crack. RTSP, designed for local networks, proved poorly suited for the internet and NAT. HLS, perfect for mass video streaming, turned out to be too slow for live scenarios. RTMP, long the low-latency standard, died along with Flash and failed to integrate into modern web architecture. The industry found itself searching for a new balance between speed, resilience, and control.
SRT did not appear as a revolution, but as an answer to a very concrete engineering need. It wasn’t designed specifically for video surveillance, yet surveillance became one of the areas where SRT’s properties matched real-world requirements almost perfectly. Low latency, UDP-based transport, loss resilience, built-in encryption, and no hard dependency on the browser—all of this made SRT a natural choice for mobile apps and operator interfaces. To understand why, we first need to understand what SRT actually is.

SRT Without Myths: UDP, Reliability, and Time Control

Secure Reliable Transport is often described as “UDP with brains,” and there is some truth in that. At its core, SRT is built on UDP - a protocol that guarantees neither delivery, nor order, nor integrity of packets, but provides minimal latency and maximum throughput. That’s why UDP has been used in real-time systems for decades, from VoIP to video conferencing. Pure UDP, however, is too fragile for the internet, where packet loss and jitter are the norm.
SRT solves this problem differently than TCP. It doesn’t try to deliver every byte at all costs. Instead, SRT introduces the concept of a time window. The protocol knows how much time it is allowed to spend trying to recover a lost packet. If the packet is not delivered and acknowledged within the configured latency, it is considered lost forever. The video moves on. As a result, the system does not “freeze” or accumulate delay, as TCP connections tend to do on poor networks.
This is the fundamental difference and the key to understanding why SRT works so well in video surveillance. Video is a continuous stream where continuity matters more than absolute accuracy. Losing a few packets may cause artifacts, but losing time destroys the very idea of live monitoring. SRT allows the developer to set this balance manually: increase latency to improve resilience, or reduce it to achieve minimal delay. This level of control is especially important in mobile networks, where channel quality can change literally every second.
SRT also includes encryption by design. It’s not an add-on or an optional TLS layer on top of something else—it’s part of the protocol itself. For video surveillance, where video is almost always personal data, this is critical. In a world where cameras watch streets, offices, entrances, and private homes, transmitting video in the clear is simply unacceptable. SRT addresses this without forcing engineers to build complex overlays on top of RTSP or invent proprietary solutions.
It’s also important to understand what SRT is not. It’s not a codec, not a container, and not a player. SRT doesn’t know what H.264 or H.265 is -it just moves bytes. In real video surveillance systems, SRT is most often used to transport an MPEG-TS stream carrying H.264 or H.265 video. This makes it compatible with the existing ecosystem of cameras and encoders, without requiring radical changes at the video source level.
Things get especially interesting when SRT meets the concept of P2P - a term that, in video surveillance, has long become more marketing than engineering.

The P2P That Doesn’t Exist: How Cloud Cameras Really Connect

If you believe marketing brochures, most cloud cameras work via P2P. The camera supposedly connects directly to the user’s phone, bypassing servers, clouds, and intermediaries. It sounds nice, but has little to do with reality. In the real internet, cameras are almost always behind NAT, often multiple layers of NAT, while mobile devices sit behind carrier CG-NAT. In such conditions, direct connections are possible only in limited scenarios and with many caveats.
The real architecture of cloud video surveillance almost always involves a server. Sometimes it’s used only for signaling and authentication; sometimes for full video relay. In most cases, the system attempts to establish a direct connection between the camera and the client, but switches to a server relay at the first sign of trouble. The user continues to see a “P2P” interface, even though the video is actually flowing through the cloud.
SRT fits perfectly into this model. It doesn’t require complex ICE logic like WebRTC when a server intermediary is used. The camera or edge server publishes a stream via SRT, and clients connect in play mode. In most cases, the client initiates the connection, which is critical for NAT traversal. The server operates in listener mode, accepting incoming UDP connections. This is a simple and reliable scheme that scales well when one camera has anywhere from one to ten viewers.
It’s important to note that this approach is neither deception nor compromise. It’s a conscious engineering choice. Fully serverless P2P in video surveillance scales poorly, is hard to debug, and is unstable in mass-market deployments. Even a minimal server provides control, security, and centralized access management. In this architecture, SRT becomes the transport layer between server and client—not a magical way to bypass all network limitations.
At this point, it becomes clear why mobile applications beat browsers in terms of latency. The difference lies not only in the protocol, but in the entire delivery model.

Why Mobile Apps Feel “More Live” Than Browsers: Architecture Matters

When a user opens a camera in a browser, they are almost always interacting with an HTML5 video element. This element supports a limited set of protocols and formats, the most important of which is HLS. HLS is designed for reliable video delivery over HTTP. It scales extremely well, caches easily, and works beautifully with CDNs. But this universality comes at the cost of latency.
HLS splits video into segments that the client downloads over HTTP. The player keeps several segments buffered to smooth out network fluctuations. This means there is always a lag between real time and what the user sees. Even with aggressive tuning, it rarely drops below a few seconds. For movies or broadcasts, that’s fine. For video surveillance, it’s critical.
Mobile applications are in a completely different position. They are not constrained by the browser stack and can use native video playback libraries. On Android, this is often libVLC or FFmpeg-based players that can work directly with UDP, SRT, and RTSP. These players give developers control over buffering, allow precise latency windows, and let them decide what to sacrifice—resilience or delay.
In addition, mobile apps have more direct access to the operating system’s network stack. They can adapt better to mobile network characteristics, react faster to changing channel quality, and use optimizations unavailable to browsers. Combined with SRT, this results in a noticeable latency advantage. In real video surveillance systems, end-to-end delay from camera to smartphone screen using SRT often falls in the one-to-two-second range—close to the practical limit without complex bidirectional protocols like WebRTC.
This doesn’t mean browsers are “bad” or “slow.” They simply solve a different problem. The browser stack is optimized for security, compatibility, and scale. It is not designed for low-level network protocol control. Mobile apps, by contrast, can afford to be more specialized and aggressive in their tuning. That’s why the video surveillance industry increasingly uses different protocols for different clients.

VSaaS Architecture: Two Protocols, One User Experience

Modern VSaaS platforms rarely bet on a single video delivery protocol. Instead, they build layered architectures where each client receives video in the format best suited to its capabilities and constraints. A typical architecture includes the camera, a cloud backend, a media layer, and client applications.
Cameras usually continue to deliver video via RTSP. It’s a proven, widely supported protocol that works well within local networks and between cameras and servers. The stream then reaches an edge or cloud server that performs multiple functions: authentication, access control, connection accounting, and when needed video relay. This is where the protocol for each client is selected.
For mobile apps, the server typically offers SRT or WebRTC. SRT is chosen where simplicity, predictability, and latency control matter most. The client connects to the server via SRT, receives the stream with minimal buffering, and sees a near-real-time picture. For browsers, the server provides HLS, sometimes in a low-latency configuration. This ensures compatibility with any device and allows scaling to thousands of users via CDN.
Crucially, for the user this all looks like a single service. They open a camera in a mobile app or in a browser and see video. The differences in protocols, delays, and buffers are hidden inside the architecture. This approach is now considered mature and industrial-grade. It acknowledges the limitations of each platform and uses their strengths instead of trying to force a universal solution.
Within this architecture, SRT occupies a clearly defined niche. It doesn’t replace HLS—it complements it. It doesn’t try to be universal—it solves a specific problem: delivering live video with minimal latency to controlled clients. That’s why SRT has taken root so well in mobile video surveillance applications.

6. Live Video

SRT is often perceived as a temporary trend or a niche solution. But viewed in the context of video surveillance evolution, it’s clearly a natural step. The industry has moved from local systems to cloud platforms, from monitors to mobile apps, from archives to real-time live interfaces. At every stage, requirements for video delivery changed—and SRT turned out to be the tool that best matches today’s user expectations.
This doesn’t mean SRT will replace all other protocols. HLS will remain the backbone for browsers and mass access. WebRTC will be used where bidirectional communication and ultra-low latency are required at any cost. RTSP will continue to live inside cameras and local networks. But SRT has secured a stable position between these worlds, offering an optimal balance for mobile and operator scenarios.
The key takeaway is that there is no longer a single “correct” protocol in video surveillance. There is architecture, where each protocol is used where it makes the most sense. SRT is not a magic wand and not “true P2P.” It’s a reliable transport that, when embedded in a well-designed VSaaS architecture, brings live video as close to real time as is realistically possible on public networks.
That’s why mobile apps will always feel more “live” than browsers, why P2P in cameras almost always implies the presence of a cloud, and why SRT today is seen not as an experiment, but as a working tool of the modern video surveillance industry.