A look at low-cost tap aggregation

25 Dec 2017

A while ago, I had a project that required capturing traffic from a number of sources, thus an adventure into possible solutions was born.

Why do we want to capture traffic?

There are numerous reasons to capture traffic including:

Network troubleshooting
Traffic monitoring (including intrusion detection)
Legal requests/requirements

Ultimately it’s about getting visibility, either for security or operations.

Our goal is given any path, to be able to replicate the traffic, with minimal impact.

How can we capture traffic?

There are generally 3 different methods for capturing traffic, each with their own complexities.

From a target device

At a basic level, a device can capture traffic in 2 ways:

‘CPU bound’; traffic that has been brought up the network stack and is ‘destined’ for this device
Raw sockets; ‘raw’ network traffic that the device receives on a network interface.

CPU bound traffic can be efficient to capture if filtered appropriately. When dealing with high traffic volumes processing traffic can take critical resources away from your business applications.

Raw sockets are generally very limited, hub-based topologies are not widely used, limiting any traffic to broadcast for the servers subnet, or targeted (CPU bound) traffic (multicast/unicast).

Generally, to be usable by an analyser the traffic would need to be encapsulated and transmitted, using further resources on the device.

Span a switch

This is similar to inline, but I’ve separated it here due to implementation differences.

SPAN’s generally come in 3 forms:

SPAN - take traffic from port A, mirror to port B, on the same device
RSPAN - take traffic from port A, mirror to port B, on a remote device (over layer 2)
ERSPAN - take traffic from port A, mirror to port B, on a remote device (over layer 3)

There are a number of downsides:

Vendors have varying levels of support;
- RSPAN on a Juniper EX series switch doesn’t work over an AE
- Tricks can be used, for example, spanning into a GRE tunnel to accomplish ERSPAN, this becomes hardware dependent though
CPU generated (ICMP/ARP/LLDP/BPDU etc) packets generally do not get mirrored
Invalid packets will not be seen (those dropped due to checksum errors for example)
CPU usage can increase drastically due to extra processing requirements
As we’re spanning in an L2 domain arp table churn can happen if you’re not careful

However, if you need insight into a switched domain and don’t want to inline-tap every cable, it might work for you.

Inline

Inline-taps come in many different flavours, supporting different media types. At a basic level, there are 2 main differences.

Passive tap

Require no power, pass the original signal onto the output
Reduce the output power by a known ratio (important when dealing with fibre)
No monitoring data or other insights possible
Very failure resistant

Logical operation

There are different technologies for mirroring the payload when using copper, these are generally resistor based, for fibre they’re either thin film or fused biconical taper based.

Note: Thin film is generally preferred for 40G+ links, due to their lower loss rate caused by more even light distribution.

The concept for all of them is the same:

Given an input of 100%
Bleed off x% of the signal to the mirror port
Pass the remaining 100-x% through to the output port

A key consideration when using fibre is the ‘split ratio’ aka how much light to bleed off, both the monitor and the output interfaces need enough light for the optics on the other end.

Active tap

They require power and re-generate the output signals.
Monitoring data can be provided (light levels etc)
When using fibre there are no split ratio considerations

Logical operation

The internal complexity varies, but the principle is:

Given an input of X
Generate the payload of X into Y and Z
Transmit Y to the monitor interface
Transmit Z to the output interface

It is possible to buy active taps, with the capability to ‘fail open’ meaning in the event of a power failure traffic will continue to flow.

I still prefer to use passive taps, which should be as resilient as a fibre patch panel.

Why do we want to aggregate it?

In the most simplistic deployment, we can simply send from the source to the destination:

However, there are a number of reasons to have an aggregation step in the middle:

Number of ingress points
- Aggregation of smaller capture points into larger interfaces
- Reduction in rack space required for analysers, servers etc
- Strategic aggregation to reduce physical requirements (fibre, rack space)
Multiple destinations
- Apply filtering logic to save on analyser licensing
- Send traffic to security appliances and network monitoring devices
Apply software logic to capture rules
Support multiple media types
- Provide longer reach for copper-based taps
- SMF, MMF, XFP, QSFP, Copper support
Single view of traffic
- Certain DPI/IDS appliances require full flows, difficult in ECMP networks
- I don’t recommend this due to the associated scaling issues

Downsides to having an aggregation step:

Potential congestion issues (we’re taking raw traffic, limited by the sender)
Single point of failure
- This could be mitigated using layer 1 fibre switches or similar
- It’s for monitoring traffic, so uptime is likely not as critical
Cost

Historically TAP aggregation has been a very expensive game, around 2-4k a port!

Thankfully Arista changed that with their 7150 series switch, which supports a tap mode as well as a ‘DANZ’ software suite for loss/latency monitoring, packet filtering and mirroring.

The DANZ suite is now available on the 7150, 7280E and 7500E series switches from Arista, opening up options from 1/2U fixed form to 7/11U chassi based deployments.

The 7150 series gives some nice features:

Port density; up to 64 10G ports
LANZ+ features for micro-burst analysis
PTP support for packet time stamping
~350ns latency for all packets
Multi-port mirroring
Hitless (ISSU) upgrades
A ‘normal’ Linux environment and Python based toolset
You configure it like any other switch!

Let’s implement a solution!

We’ll keep it simple with 1 aggregation point, 6 inputs and 2 outputs. In reality, this could be multiple levels of aggregation.

Inputs:

2 x Passive taps
2 x SPANs
2 x Active taps

Logical Groups:

Group 1 - Internet
Group 2 - Corporate
Group 3 - Regional

Outputs:

Bro Network Security Monitor (Internet + Regional only)
Server (Corporate only)

Logical Diagram

Switch configuration

The switch config is where we wire everything together, there are 3 key concepts:

Tap - input
Tool - output
Tap Group - logical grouping of TAP ports

The configuration is quite simple and well documented.

First, let’s put the switch into tap mode

switch>en
switch#configure terminal
switch(config)#tap aggregation
switch(config-tap-agg)#mode exclusive

This will place all switch ports into error disabled and enable tap/tool ports.

Next, define our tap ports

switch(config)#interface ethernet 1
switch(config-if-Et1)#description Transit Provider 1
switch(config-if-Et1)#switchport mode tap
switch(config-if-Et1)#switchport tool group INTERNET

switch(config)#interface ethernet 2
switch(config-if-Et2)#description Transit Provider 2
switch(config-if-Et2)#switchport mode tap
switch(config-if-Et2)#switchport tool group INTERNET

switch(config)#interface ethernet 3
switch(config-if-Et3)#description Corporate Spine 1
switch(config-if-Et3)#switchport mode tap
switch(config-if-Et3)#switchport tool group CORPORATE

switch(config)#interface ethernet 4
switch(config-if-Et4)#description Corporate Spine 2
switch(config-if-Et4)#switchport mode tap
switch(config-if-Et4)#switchport tool group CORPORATE

switch(config)#interface ethernet 5
switch(config-if-Et5)#description Wan Provider 1
switch(config-if-Et5)#switchport mode tap
switch(config-if-Et5)#switchport tool group WAN

switch(config)#interface ethernet 6
switch(config-if-Et6)#description Wan Provider 2
switch(config-if-Et6)#switchport mode tap
switch(config-if-Et6)#switchport tool group WAN

We now have 3 groups with 3 interfaces in each.

Finally, map those groups onto our outputs.

switch(config)#interface ethernet 10
switch(config-if-Et10)#description Bro Network Security Monitor
switch(config-if-Et10)#switchport mode tool
switch(config-if-Et10)#switchport tool group set INTERNET WAN

switch(config)#interface ethernet 11
switch(config-if-Et11)#description Secret Server
switch(config-if-Et11)#switchport mode tool
switch(config-if-Et11)#switchport tool group set CORPORATE

Et10 + Et11 will now receive any traffic sent to their relevant groups.

Advanced features

In the demo, we have a static input -> output allocation in real life this can be software controlled or filter based.

We could also truncate packets to look at only their headers, potentially useful for encrypted traffic.

A simple traffic steering example is as below:

For traffic coming into Eth1
Match traffic targeting 8.8.8.8
Send to tap group GOOGLE_DNS
Send tap group GOOGLE_DNS to Eth20

switch(config)#ip access-list ACL_GOOGLE_DNS
switch(config-acl-ACL_GOOGLE_DNS)#permit ip any 8.8.8.8/32

switch(config)#class-map type tapagg match-any TAP_CLASS_MAP
switch(config-cmap-TAP_CLASS_MAP)#match ip access-group ACL_GOOGLE_DNS

switch(config)#policy-map type tapagg TAP_POLICY
switch(config-pmap-TAP_POLICY)#class TAP_CLASS_MAP
switch(config-pmap-TAP_POLICY-TAP_CLASS_MAP)#set aggregation-group GOOGLE_DNS

switch(config)#interface ethernet 20
switch(config-if-Et20)#description Magic Box
switch(config-if-Et20)#switchport mode tool
switch(config-if-Et20)#switchport tool group set GOOGLE_DNS

switch(config)#interface ethernet1
switch(config-if-Et1)#service-policy type tapagg input TAP_CLASS_MAP

For software-based control, Arista provides a powerful HTTP API as well as XMPP client support and ‘on-device’ APIs. The Python eAPI client can be found on GitHub, with some examples.

Summary

It is cost effective to deploy tap infrastructure where required. The offering from Arista is very powerful, allowing flexibility to meet any number of requirements.

With a number of integration options (JSON API, Python API, XMPP client), having dynamic tapping capabilities is a nice spanner to have in your toolbox.

And the cost? Basically, about the same as a 10G switch with routing functionality.

Naturally, the cost is varied depending on physical requirements (fibre/copper runs), port speeds (1/10/25/40/100G) and port count; this is applicable to any switch or other tap based deployment.

Footnote

I would advise any tap deployments to be carefully planned, certain topologies, such as multi-stage clos networks do not make good tap targets, both due to the number of links involved and the resulting bandwidth requirements for the TAP infrastructure.

Depending on your goals, tapping at natural points of congestion, such as the ingress/egress points for your high performance/bandwidth network segments (transit links, firewalls etc), will likely provide highly useful information at a vastly reduce capture complexity.

A look at low-cost tap aggregation

Why do we want to capture traffic?

How can we capture traffic?

From a target device

Span a switch

Inline

Passive tap

Logical operation

Active tap

Logical operation

Why do we want to aggregate it?

Let’s implement a solution!

Logical Diagram

Switch configuration

Advanced features

Summary

Footnote

Related posts