For many years, audio in IP cameras was treated as a side effect. The camera had to see, the disk had to record, and the microphone was supposed to “capture something.” If, when reviewing the archive, you could make out a few individual words, that was already considered a success. This is exactly why the settings of modern IP cameras still show a strange mix of codecs from the telephone era, half-forgotten standards from the early MPEG days, and relatively modern algorithms that look as if they ended up in the interface by accident.
In real systems, the choice of an audio codec was almost never deliberate. It was left “as shipped from the factory,” selected using the principle “as long as the recorder doesn’t complain,” or simply ignored altogether. As a result, audio existed only formally, without any real requirements for quality, stability, or further processing.
By 2026, the situation has changed fundamentally. Audio is no longer secondary. It is part of video analytics, event search, ASR systems, incident analysis, legal disputes, and simply human understanding of what was happening outside the frame. Detectors of screams, gunshots, conflicts, or a baby crying depend not on megapixels, but on the quality of the audio signal. And here an unpleasant but important truth emerges: the wrong audio codec can kill sound just as reliably as a bad microphone or a poorly chosen camera placement.
Audio stream architecture in an IP camera
To understand why the codec plays such a critical role, it is worth looking at the typical audio processing chain in an IP camera:
An analog or MEMS microphone.
An analog-to-digital converter (ADC).
Pre-processing such as AGC, filtering, and noise reduction.
Encoding of the audio signal using the selected codec.
Multiplexing with the video stream.
Transmission over RTSP, HTTP, or a proprietary protocol.
Decoding on the NVR, VMS, or client side.
The choice of audio codec and sampling frequency affects several layers of the system at once: network load, archive size, latency, compatibility with the receiving side, and, most importantly today, how suitable the audio is for analytics and automated processing.
PCM: the most honest and the most inconvenient option
Let us start with the most straightforward solution: PCM (LPCM). This is uncompressed audio, essentially a digital copy of what the microphone hears. PCM does not distort speech, does not add artifacts, and does not mask noise. In theory, it is ideal.
In practice, PCM is merciless to infrastructure. The bitrate grows linearly with sampling frequency and bit depth. Even 16 bits at 16 kHz produce a stream that looks excessive for audio in a video surveillance context. The archive starts growing faster than the video, the network becomes overloaded, remote access begins to stutter, and some NVRs and cloud services react to PCM as if you tried to plug a vinyl record into a USB port.
From a licensing perspective, PCM is perfectly clean. It is not a codec but a data representation format, with no patents or royalties. But in real distributed video surveillance systems, PCM survives only in closed configurations where the developer controls the entire chain from camera to player. For all other scenarios, PCM remains a theoretical reference rather than a practical standard.
The telephone era: G.711 and G.726
The next layer is the old guard: G.711 and G.726. These codecs came into video surveillance from the telephony world, where intelligibility of speech over poor communication channels was the main priority.
G.711 operates at an 8 kHz sampling rate and produces the familiar “telephone” sound. Speech is intelligible, but everything outside the narrow band simply disappears. G.726 uses ADPCM compression, saves bitrate, and sounds slightly better, but it does not fundamentally change the situation.
The main advantage of these codecs is compatibility. They are supported almost everywhere, their licenses expired long ago, and there are no surprises. The main drawback is the quality ceiling, which cannot be broken by any settings. A baby crying, a door slam, or background room noise turns into a monotonous mush. For basic security and simple monitoring this may be enough, but for analytics and ASR it is no longer sufficient.
An attempt to widen the band: G.722 and G.722.1
G.722 and its extension G.722.1 were attempts to move away from telephone quality. A 16 kHz sampling rate, more natural speech, and an extended spectrum. On paper, everything looks great.
In practice, the typical video surveillance story begins. The codec exists, but support is fragmented. One camera records it correctly, another sends a non-standard RTP stream, a third recorder accepts it but does not play it back, and a cloud service simply drops the audio without errors. There are almost no licensing issues, but engineering reality makes these codecs a risky choice for systems where predictability matters.
AAC: the de facto standard few talk about openly
AAC is a rare case where everything came together. This codec was designed for multimedia, but it fits video surveillance remarkably well. It encodes speech efficiently, does not turn noise into artifacts, uses bitrate effectively, and scales well with sampling frequency.
AAC supports 8, 16, 32, 44.1, and 48 kHz. In practice, 16 kHz already sounds noticeably better than G.711, while 32 kHz provides additional headroom for complex scenes. At the same time, the archive does not grow to absurd sizes, and compatibility with MP4, RTSP, and cloud platforms remains high.
Yes, AAC is a patented codec. Formally, it is licensed through patent pools. But in reality, the end user of IP cameras never encounters this. The camera manufacturer has already included licensing in the product cost. This is why AAC has become the de facto standard for most scenarios today: local archives, cloud storage, analytics, ASR, and remote viewing. It is not perfect, but it is stable. And in video surveillance, stability is valued above almost everything else.
Exotics and the future: Opus, Speex, AMR, MPEG-2 Layer II
In the settings of some cameras, you may encounter other codecs as well. MPEG-2 Layer II is a reliable but morally outdated relic of the MPEG era. Speex is the predecessor of Opus and is interesting mainly from a historical perspective. AMR and AMR-WB came from the mobile world and largely stayed there, failing to fit into camera architectures.
Opus technically outperforms most codecs on the list. It is free, flexible, works extremely well with speech and noise, and supports a wide range of sampling frequencies. Yet it is almost never found in IP cameras. The reason is banal: a conservative industry, old SoCs, established software stacks, and manufacturers’ reluctance to risk compatibility. Opus is the future that has not yet arrived in mass video surveillance.
Sampling frequency: the parameter that matters more than it seems
Sampling frequency directly defines the audio spectrum and its suitability for analytics.
8 kHz is telephone quality, suitable only for basic speech intelligibility.
16 kHz is the minimum acceptable level for analytics and ASR.
32 kHz provides better detail and works better in noisy scenes.
44.1 and 48 kHz are excessive for most video surveillance tasks and create unnecessary load.
In practice, the optimal choice for IP cameras is 16 or 32 kHz, depending on microphone quality and system goals.
Licensing and legal nuances
Free codecs such as PCM, G.711, G.722, Opus, and Speex do not require royalties, but they do not always deliver the required quality or compatibility. Patented codecs like AAC and AMR are typically already licensed by the camera manufacturer. Problems usually arise not on the camera side, but during server-side transcoding or cloud processing, where licenses must be considered separately.
Practical recommendations
If you strip away theory and marketing, the conclusion is simple:
For most systems, choose AAC.
Set the sampling frequency to 16 or 32 kHz.
Use G.711 only for compatibility reasons.
Use PCM only in special cases.
Always verify real codec support in your VMS and NVR.
And one important practical point. To properly configure an audio codec, the camera must provide full web access to its settings. Cameras that are configured only through a mobile app or a cloud service often hide real audio parameters and do not allow control over the codec or sampling frequency. In such systems, sound quality is determined not by the engineer, but by the manufacturer’s marketing decisions.
Conclusion
Modern IP cameras support a wide range of audio codecs, reflecting not an evolution but a historical cross-section of the industry. By 2026, audio in video surveillance has stopped being background noise and has become data. That is why the choice of audio codec and sampling frequency is an architectural decision, not just a checkbox in the settings.
Today, AAC at 16 or 32 kHz remains the most balanced and predictable option. It is not the trendiest and not the most ideologically pure, but it works everywhere, all the time, without calls to technical support. And in video surveillance, that is still the main criterion of quality.
In real systems, the choice of an audio codec was almost never deliberate. It was left “as shipped from the factory,” selected using the principle “as long as the recorder doesn’t complain,” or simply ignored altogether. As a result, audio existed only formally, without any real requirements for quality, stability, or further processing.
By 2026, the situation has changed fundamentally. Audio is no longer secondary. It is part of video analytics, event search, ASR systems, incident analysis, legal disputes, and simply human understanding of what was happening outside the frame. Detectors of screams, gunshots, conflicts, or a baby crying depend not on megapixels, but on the quality of the audio signal. And here an unpleasant but important truth emerges: the wrong audio codec can kill sound just as reliably as a bad microphone or a poorly chosen camera placement.
Audio stream architecture in an IP camera
To understand why the codec plays such a critical role, it is worth looking at the typical audio processing chain in an IP camera:
An analog or MEMS microphone.
An analog-to-digital converter (ADC).
Pre-processing such as AGC, filtering, and noise reduction.
Encoding of the audio signal using the selected codec.
Multiplexing with the video stream.
Transmission over RTSP, HTTP, or a proprietary protocol.
Decoding on the NVR, VMS, or client side.
The choice of audio codec and sampling frequency affects several layers of the system at once: network load, archive size, latency, compatibility with the receiving side, and, most importantly today, how suitable the audio is for analytics and automated processing.
PCM: the most honest and the most inconvenient option
Let us start with the most straightforward solution: PCM (LPCM). This is uncompressed audio, essentially a digital copy of what the microphone hears. PCM does not distort speech, does not add artifacts, and does not mask noise. In theory, it is ideal.
In practice, PCM is merciless to infrastructure. The bitrate grows linearly with sampling frequency and bit depth. Even 16 bits at 16 kHz produce a stream that looks excessive for audio in a video surveillance context. The archive starts growing faster than the video, the network becomes overloaded, remote access begins to stutter, and some NVRs and cloud services react to PCM as if you tried to plug a vinyl record into a USB port.
From a licensing perspective, PCM is perfectly clean. It is not a codec but a data representation format, with no patents or royalties. But in real distributed video surveillance systems, PCM survives only in closed configurations where the developer controls the entire chain from camera to player. For all other scenarios, PCM remains a theoretical reference rather than a practical standard.
The telephone era: G.711 and G.726
The next layer is the old guard: G.711 and G.726. These codecs came into video surveillance from the telephony world, where intelligibility of speech over poor communication channels was the main priority.
G.711 operates at an 8 kHz sampling rate and produces the familiar “telephone” sound. Speech is intelligible, but everything outside the narrow band simply disappears. G.726 uses ADPCM compression, saves bitrate, and sounds slightly better, but it does not fundamentally change the situation.
The main advantage of these codecs is compatibility. They are supported almost everywhere, their licenses expired long ago, and there are no surprises. The main drawback is the quality ceiling, which cannot be broken by any settings. A baby crying, a door slam, or background room noise turns into a monotonous mush. For basic security and simple monitoring this may be enough, but for analytics and ASR it is no longer sufficient.
An attempt to widen the band: G.722 and G.722.1
G.722 and its extension G.722.1 were attempts to move away from telephone quality. A 16 kHz sampling rate, more natural speech, and an extended spectrum. On paper, everything looks great.
In practice, the typical video surveillance story begins. The codec exists, but support is fragmented. One camera records it correctly, another sends a non-standard RTP stream, a third recorder accepts it but does not play it back, and a cloud service simply drops the audio without errors. There are almost no licensing issues, but engineering reality makes these codecs a risky choice for systems where predictability matters.
AAC: the de facto standard few talk about openly
AAC is a rare case where everything came together. This codec was designed for multimedia, but it fits video surveillance remarkably well. It encodes speech efficiently, does not turn noise into artifacts, uses bitrate effectively, and scales well with sampling frequency.
AAC supports 8, 16, 32, 44.1, and 48 kHz. In practice, 16 kHz already sounds noticeably better than G.711, while 32 kHz provides additional headroom for complex scenes. At the same time, the archive does not grow to absurd sizes, and compatibility with MP4, RTSP, and cloud platforms remains high.
Yes, AAC is a patented codec. Formally, it is licensed through patent pools. But in reality, the end user of IP cameras never encounters this. The camera manufacturer has already included licensing in the product cost. This is why AAC has become the de facto standard for most scenarios today: local archives, cloud storage, analytics, ASR, and remote viewing. It is not perfect, but it is stable. And in video surveillance, stability is valued above almost everything else.
Exotics and the future: Opus, Speex, AMR, MPEG-2 Layer II
In the settings of some cameras, you may encounter other codecs as well. MPEG-2 Layer II is a reliable but morally outdated relic of the MPEG era. Speex is the predecessor of Opus and is interesting mainly from a historical perspective. AMR and AMR-WB came from the mobile world and largely stayed there, failing to fit into camera architectures.
Opus technically outperforms most codecs on the list. It is free, flexible, works extremely well with speech and noise, and supports a wide range of sampling frequencies. Yet it is almost never found in IP cameras. The reason is banal: a conservative industry, old SoCs, established software stacks, and manufacturers’ reluctance to risk compatibility. Opus is the future that has not yet arrived in mass video surveillance.
Sampling frequency: the parameter that matters more than it seems
Sampling frequency directly defines the audio spectrum and its suitability for analytics.
8 kHz is telephone quality, suitable only for basic speech intelligibility.
16 kHz is the minimum acceptable level for analytics and ASR.
32 kHz provides better detail and works better in noisy scenes.
44.1 and 48 kHz are excessive for most video surveillance tasks and create unnecessary load.
In practice, the optimal choice for IP cameras is 16 or 32 kHz, depending on microphone quality and system goals.
Licensing and legal nuances
Free codecs such as PCM, G.711, G.722, Opus, and Speex do not require royalties, but they do not always deliver the required quality or compatibility. Patented codecs like AAC and AMR are typically already licensed by the camera manufacturer. Problems usually arise not on the camera side, but during server-side transcoding or cloud processing, where licenses must be considered separately.
Practical recommendations
If you strip away theory and marketing, the conclusion is simple:
For most systems, choose AAC.
Set the sampling frequency to 16 or 32 kHz.
Use G.711 only for compatibility reasons.
Use PCM only in special cases.
Always verify real codec support in your VMS and NVR.
And one important practical point. To properly configure an audio codec, the camera must provide full web access to its settings. Cameras that are configured only through a mobile app or a cloud service often hide real audio parameters and do not allow control over the codec or sampling frequency. In such systems, sound quality is determined not by the engineer, but by the manufacturer’s marketing decisions.
Conclusion
Modern IP cameras support a wide range of audio codecs, reflecting not an evolution but a historical cross-section of the industry. By 2026, audio in video surveillance has stopped being background noise and has become data. That is why the choice of audio codec and sampling frequency is an architectural decision, not just a checkbox in the settings.
Today, AAC at 16 or 32 kHz remains the most balanced and predictable option. It is not the trendiest and not the most ideologically pure, but it works everywhere, all the time, without calls to technical support. And in video surveillance, that is still the main criterion of quality.