The IPFIX (IP Flow Information Export) protocol provides an extensible
standard for transmitting network flow data.
A key difference compared to the likes of sflow,
is the template-based nature of data.
While very similar to NetFlow version 9, IPFIX enables variable length fields
and vendor extensions. This makes the protocol suitable for different types of
performance data, as desired by any vendor.
A recent project required some processing of IPFIX flow data,
which this post will focus on.
TL;DR The full implementation can be found on
Github
As we previously discussed,
using cheap switches to aggregate multiple tap sources gives you a lot of power.
However, given the multiple feeds, how can you measure timing information accurately 1 hop away?
Using hardware time stamping provides a highly accurate record of when packets
were processed by devices, making it perfect for TAP aggregation.
Revisiting the 7150 platform
On the 7150 series hardware, time stamping is supported at line-rate using PTP.
You have two options for timestamp placement;
Replacement of the FCS (mac timestamp replace-fcs):
Appending of the timestamp (mac timestamp before-fcs):
The implementation of this is a little ‘quirky’.
Looking at the Timestamp value alone will not help you as it’s an internal
ASIC counter on the switch, essentially providing the lower half of the timestamp.
To calculate the actual (Unix based) timestamp, another keyframe packet has
to be processed and tracked (every ~6 seconds); providing the first half of the timestamp.
While possible to implement, the imposed state and skew calculations is a little unappealing.
A look into the 7500{E,R}/7280{E,R} series
On the newer platforms, Arista has moved away from using the keyframe setup and
introduced a custom EtherType. Again, using hardware time stamping at line-rate & supporting PTP.
There is 3 possible time stamping modes on the 7500{E,R} & 7280{E,R} series switches:
64-bit header timestamp; i.e., encapsulated in the L2 header
48-bit header timestamp; i.e., encapsulated in the L2 header
48-bit timestamp that replaces the Source MAC
We will focus on the first 2 options that use a customer EtherType inside the layer 2 header.
Note: All timestamps are captured upon packet ingress and stamped on packet egress.
A look into the packet format
Let’s compare a normal ethernet header:
To one with the customer EtherType inserted:
Note: .1q payloads are also supported, with the EtherType coming after the Source Address
As you can see an extra 4 fields have been inserted into the header;
EtherType - 0xD28B - An identifier for AristaEtherType
Protocol sub-type - 0x1 - A sub-identifier for the AristaEtherType
Version - 0x10 or 0x20 - An identifier for either 64bit or 48bit
Timestamp - An IEEE 1588 time of day format
The timestamp is either 32 bits (seconds) followed by 32 bits (nanoseconds) or
16 bits (seconds) followed by 32 bits (nanoseconds) depending on the 64 or 48bit mode.
Configuration
Enabling hardware timestamping on the platform is rather simple;
mac timestamp header enables timestamping on tool ports
mac timestamp header format <64bit | 48bit> sets the format of the timestamp
mac timestamp replace source-mac enables replacing the source mac address with the timestamp
There are some limitations to the time stamping support, notably;
Timestamping is done after packet processing, resulting in ~10ns of delay
64bit timestamps may rollover inconsistency every 4 seconds causing jumps between packets
Decoding the packets
Now we’ve changed the Ethernet header, it requires a specific decoder to be able to process.
Without a specific decoder, it is no longer a valid Ethernet header as Length field
contains a meaningless value.
Arista provides an
LUA extension for Wireshark for this purpose.
Decoding custom EtherTypes in gopacket
gopacket has a very useful pcap interface,
making it very easy to process data collected from TAP infrastructure.
Investigating the structure, it made sense to implement a custom
layer to handle
our EtherType.
After some experimentation, while this provided decoding of the timestamp data,
it prevented further processing of the packets, resulting in the IP layer
being inaccessible; this was complicated due to our now invalid Ethernet header.
A simple solution of extending the built-in
EthernetType
was called for.
// Copyright 2012 Google, Inc. All rights reserved.// Copyright 2009-2011 Andreas Krennmair. All rights reserved.//// Use of this source code is governed by a BSD-style license// that can be found in the LICENSE file in the root of the source// tree.packagedecoderimport("encoding/binary""errors""github.com/google/gopacket""github.com/google/gopacket/layers""net")// This layer has a two-byte protocol subtype of 0x1,// a two-byte protocol version of 0x10 and// an eight-byte UTC timestamp in IEEE 1588 time of format// So that would be 12 bytes in totally we need to strip off right after the src mactypeAristaEtherTypestruct{ProtocolSubTypeuint16ProtocolVersionuint16TimestampSecondsuint32TimestampNanoSecondsuint32}// AristaExtendedEthernet is the layer of a normal or Arista extended Ethernet frame headers.// This is the same as layers.Ethernet, but may have AristaEtherType filled with datatypeAristaExtendedEthernetstruct{layers.EthernetAristaEtherTypeAristaEtherType}func(eth*AristaExtendedEthernet)DecodeFromBytes(data[]byte,dfgopacket.DecodeFeedback)error{iflen(data)<14{returnerrors.New("AristaExtendedEthernet packet too small")}eth.DstMAC=net.HardwareAddr(data[0:6])eth.SrcMAC=net.HardwareAddr(data[6:12])// https://eos.arista.com/eos-4-18-1f/tap-aggregation-ingress-header-time-stamping/// Arista places 12 bytes directly after the src mac, see AristaEtherType comments for structure// We handle both timestamped and non-timestamped frames hereetherType:=binary.BigEndian.Uint16(data[12:14])iflen(data)>=26&ðerType==53899{eth.AristaEtherType=AristaEtherType{ProtocolSubType:binary.BigEndian.Uint16(data[14:16]),ProtocolVersion:binary.BigEndian.Uint16(data[16:18]),TimestampSeconds:binary.BigEndian.Uint32(data[18:22]),TimestampNanoSeconds:binary.BigEndian.Uint32(data[22:26]),}eth.EthernetType=layers.EthernetType(binary.BigEndian.Uint16(data[26:28]))eth.BaseLayer=layers.BaseLayer{data[:28],data[28:]}}else{eth.EthernetType=layers.EthernetType(binary.BigEndian.Uint16(data[12:14]))eth.BaseLayer=layers.BaseLayer{data[:14],data[14:]}}// Logic from the upstream Ethernet codeifeth.EthernetType<0x0600{eth.Length=uint16(eth.EthernetType)eth.EthernetType=layers.EthernetTypeLLCifcmp:=len(eth.Payload)-int(eth.Length);cmp<0{df.SetTruncated()}elseifcmp>0{eth.Payload=eth.Payload[:len(eth.Payload)-cmp]}}returnnil}// Required methods to be a valid layerfunc(e*AristaExtendedEthernet)LinkFlow()gopacket.Flow{returngopacket.NewFlow(layers.EndpointMAC,e.SrcMAC,e.DstMAC)}func(e*AristaExtendedEthernet)LayerType()gopacket.LayerType{returngopacket.LayerType(17)}func(eth*AristaExtendedEthernet)NextLayerType()gopacket.LayerType{returneth.EthernetType.LayerType()}// Public functionfuncDecodeAristaExtendedEthernet(data[]byte,pgopacket.PacketBuilder)error{eth:=&AristaExtendedEthernet{}err:=eth.DecodeFromBytes(data,p)iferr!=nil{returnerr}p.AddLayer(eth)p.SetLinkLayer(eth)returnp.NextDecoder(eth.EthernetType)}
Now we have the custom decoder, we just need to register it with gopacket.
This makes gopacket use our decoder implementation rather than the built-in Ethernet one.
It is commonly accepted to use TLS when accessing services over the internet, whether
they are based on HTTP, SMTP, IMAP, POP, FTP or any number of other protocols.
It is also commonly accepted to terminate those TLS connections on the edge,
handling all internal communications in plain text. This is for a number of reasons around scalability, performance and trust.
As technology stacks have matured, a number of security standards have been created,
including those for card handling (PCI DDS); many of these still contain phrasing such
as ‘encrypt transmission of cardholder data across open, public networks’, with the
definitions being open to interpretation.
Pause for a moment and consider if these scenarios are ‘across open, public networks’:
A point to point circuit provided over an external ‘dark fibre’ (DWDM or similar) network
A point to point wireless link between 2 buildings provided by a 3rd party
Cross connections between 2 suites within a datacenter, via the meet me room
I imagine most people would argue these are private:
All services are dedicated to you
Traffic is isolated from other customers
You control both ends of the connection
However, there are also risks, as they all pass through physical assets you don’t control:
Traffic interception; it has been widely published that data from the likes of Google has been intercepted and used for profiling activities
Network access; As with a physical intrusion to your rack/office, it may be possible to gain network access via a cross-connect, or inter-site link, depending on topology
Redundancy; A targeted attack on physical infrastructure could place your business operations at risk
Thankfully, most providers and many ISO standards have well defined physical access controls,
which limit the possibilities of the above however that isn’t very effective against a nation-state,
or a cyber-based attack on a provider.
Many businesses have a wealth of information useful to a nation-state, from habits and preferences
to medical or travel data. It might be paranoia until they’re out to get you.
Ultimately this comes down to risk management and if you want to be ‘compliant’ or ‘secure’
in regards to your customer’s data.
Assuming we want to encrypt data end-to-end, let’s look at the technologies available.
TLS
As briefly noted above, the standard for encryption in the public network space is
Transport Layer Security
(TLS).
There are multiple implementations of TLS, with 1.2 currently being the standard (1.3 is in draft),
the version and associated cryptographic ciphers are usually associated with the
support requirements; many older browsers and SSL libraries don’t support the most
secure choices.
A good starting point is the excellent cipherli.st site,
highlighting secure configs for most platforms.
The results should then be checked, either using openssl s_client,
ssllabs.com or similar.
Generally, user-facing TLS is simple to deploy, with the potential for
small compatibility issues (including breaking certain browsers).
The direction for Google Chrome and others is to start displaying HTTP sites in
the same manner as invalid SSL is currently shown (red bars or similar), so it’s
highly advisable even if you don’t transmit any ‘sensitive info’ (sensitive here
includes tracking data, such as cookies).
What about the cost? Platforms such as Lets Encrypt provide free SSL certs,
trusted by all major browsers. For dedicated extended validation certificates,
a few hundred dollars is a small cost for most online businesses.
TLS internally
Historically concerns about the performance of TLS have stunted the deployment internally,
modern versions of the libraries combined with the current generations of CPUs mean
TLS is not slow! (mostly).
The excellent istlsfastyet.com goes into detail about
the current state of TLS performance. The key takeaway is, when configured
correctly, TLS at the scale of Facebook and Google performs fast enough with minimal
CPU overhead.
Using the most secure cipher suites (ECDHE) are a little more costly, but with mitigations in place
(HTTP keepalives, session resumption etc), the performance overhead is negligible.
Depending on your environment, you may purchase or use CA-signed certificates
as is the case with external traffic however at a certain scale an internal
certificate authority makes sense.
There is a certain level of complexity in deploying and maintaining a secure
internal certificate authority and many tools exist to help with this.
Another advantage of an internal CA is being able to use certificate-based
authentication for clients, enabling devices to prove their identity.
Alternatives
Depending on your environment, encrypting all inter-device traffic with TLS may
be possible.
Given a small load balanced LAMP stack this should be relatively easy:
TLS from the user to the load balancer
TLS from the load balancer to the web server
TLS from PHP to MySQL
However, the number of services can easily spiral, given the example above
we could easily have:
SMTP relays
Memcache/Redis caching
NFS based storage
Monitoring agents
You may also have services which don’t support TLS:
RADIUS
Windows Distributed File Shares
Access control / CCTV systems designed for closed networks
There are 2 areas to consider here:
Traffic passing over your network (Physical access restrictions in place)
Traffic passing over external infrastructure
For the first case:
I strongly suggest to deploy TLS where possible, at worst, it is another layer of
defence should a malicious device get into the network.
For protocols lacking encryption support, their risk is likely low
If their risk is not low, I’d suggest re-evaluating the technology choice
Inline encryption is possible but complicated to scale at this level
For the second, we have a number of options described below.
Layer 1 encryption
There are a number of ‘black box’ solutions, which sit in-line to the network.
The general principle is un-encrypted data comes in one end, encrypted data comes
out the other; the reverse then happens to give you un-encrypted data on the other end.
These are generally expensive appliances, licensed by port or bandwidth capability.
They are also generally completely closed boxes, operating strictly at layer 1.
Deployed across DWDM or similar networks, these devices should ‘just work’ and provide
full encryption (layer 1 to 7). They are limited to ‘point to point’ links.
A side effect of the point to point encryption is the prevention of unauthorised traffic.
IEEE 802.1AE (MacSec)
A more open and industry standard approach would be to deploy MacSec.
MacSec provides encryption of the layer 2 header and up, but leaves the src/dest
mac addresses exposed.
It operates similar to an Ethernet frame within a layer 2 network, with the
packets containing 4 fields
Any layer 2 data outside of the mac addresses (VLAN tag, LLDP etc) is contained
within the encrypted data.
The security tag and ICV are used internally for MacSec, with the mac addresses being
used for forwarding.
There is a hardware dependency associated with MacSec, as the encryption is done
in hardware to achieve line rate speeds. This varies between vendors but can
be in the form of dedicated line cards or whole products.
It is possible to offload the encryption to MacSec capable switches,
allowing routers and line cards to remain, with the switch sitting inline.
It may not be desirable to put a layer 2 device in your path, though it is
likely that the path will already be using BFD or similar to account for
any provider interruptions, which don’t result in an interface flap.
There are also some implementation considerations:
Additional header size needs to be accounted for in downstream MTUs
Certain providers filter layer 2 traffic, they may filter the MacSec control messages!
As with Layer 1 encryption, this prevents unauthorised traffic entering the network, as
well as protecting against interception.
DMVPN / Mesh VPN
It may be desirable in some cases to form a software-based VPN mesh over
your existing network, providing encryption between 2 or more points.
This could be in the form of a single IPsec tunnel, or a complex hub-spoke DMVPN
network. These could be deployed on dedicated devices or end-user devices.
For high traffic applications, these approaches are likely not applicable, due to
line rate speeds being desirable, but in branch or remote worker applications
they can be powerful options in your toolbox.
Summary
There is not a 1 solution fits all. It is very dependent upon your
environment and more specifically the applications within it.
My advice is to look at the risk in each area, design an appropriate solution and
test it.
A targeted approach keeps things simple to start, but personally, I look to
provide end-to-end encryption everywhere… trusting no one.
If you have no clear areas of risk, start with everywhere you touch the outside
world, either via a public interface or provider managed services.
Ultimately it’s about getting visibility, either for security or operations.
Our goal is given any path, to be able to replicate the traffic, with minimal impact.
How can we capture traffic?
There are generally 3 different methods for capturing traffic, each with their
own complexities.
From a target device
At a basic level, a device can capture traffic in 2 ways:
‘CPU bound’; traffic that has been brought up the network stack and is ‘destined’ for this device
Raw sockets; ‘raw’ network traffic that the device receives on a network interface.
CPU bound traffic can be efficient to capture if filtered appropriately. When dealing with high
traffic volumes processing traffic can take critical resources away from your business applications.
Raw sockets are generally very limited, hub-based topologies are not widely used, limiting any traffic to broadcast for the servers subnet, or targeted (CPU bound) traffic (multicast/unicast).
Generally, to be usable by an analyser the traffic would need to be encapsulated and transmitted,
using further resources on the device.
Span a switch
This is similar to inline, but I’ve separated it here due to implementation differences.
SPAN’s generally come in 3 forms:
SPAN - take traffic from port A, mirror to port B, on the same device
RSPAN - take traffic from port A, mirror to port B, on a remote device (over layer 2)
ERSPAN - take traffic from port A, mirror to port B, on a remote device (over layer 3)
There are a number of downsides:
Vendors have varying levels of support;
RSPAN on a Juniper EX series switch doesn’t work over an AE
Tricks can be used, for example, spanning into a GRE tunnel to accomplish ERSPAN, this becomes hardware dependent though
CPU generated (ICMP/ARP/LLDP/BPDU etc) packets generally do not get mirrored
Invalid packets will not be seen (those dropped due to checksum errors for example)
CPU usage can increase drastically due to extra processing requirements
As we’re spanning in an L2 domain arp table churn can happen if you’re not careful
However, if you need insight into a switched domain and don’t want to inline-tap every cable,
it might work for you.
Inline
Inline-taps come in many different flavours, supporting different media types. At a basic level, there are 2 main differences.
Passive tap
Require no power, pass the original signal onto the output
Reduce the output power by a known ratio (important when dealing with fibre)
No monitoring data or other insights possible
Very failure resistant
Logical operation
There are different technologies for mirroring the payload when using copper, these
are generally resistor based, for fibre they’re either thin film or fused biconical taper based.
Note: Thin film is generally preferred for 40G+ links, due to their lower loss rate caused by more even light distribution.
The concept for all of them is the same:
Given an input of 100%
Bleed off x% of the signal to the mirror port
Pass the remaining 100-x% through to the output port
A key consideration when using fibre is the ‘split ratio’ aka how much light to bleed off,
both the monitor and the output interfaces need enough light for the optics on the other end.
Active tap
They require power and re-generate the output signals.
Monitoring data can be provided (light levels etc)
When using fibre there are no split ratio considerations
Logical operation
The internal complexity varies, but the principle is:
Given an input of X
Generate the payload of X into Y and Z
Transmit Y to the monitor interface
Transmit Z to the output interface
It is possible to buy active taps, with the capability to ‘fail open’ meaning in the event
of a power failure traffic will continue to flow.
I still prefer to use passive taps, which should be as resilient as a fibre patch panel.
Why do we want to aggregate it?
In the most simplistic deployment, we can simply send from the source to the destination:
However, there are a number of reasons to have an aggregation step in the middle:
Number of ingress points
Aggregation of smaller capture points into larger interfaces
Reduction in rack space required for analysers, servers etc
Strategic aggregation to reduce physical requirements (fibre, rack space)
Multiple destinations
Apply filtering logic to save on analyser licensing
Send traffic to security appliances and network monitoring devices
Apply software logic to capture rules
Support multiple media types
Provide longer reach for copper-based taps
SMF, MMF, XFP, QSFP, Copper support
Single view of traffic
Certain DPI/IDS appliances require full flows, difficult in ECMP networks
I don’t recommend this due to the associated scaling issues
Downsides to having an aggregation step:
Potential congestion issues (we’re taking raw traffic, limited by the sender)
Single point of failure
This could be mitigated using layer 1 fibre switches or similar
It’s for monitoring traffic, so uptime is likely not as critical
Cost
Historically TAP aggregation has been a very expensive game, around 2-4k a port!
Thankfully Arista changed that with their 7150 series switch, which supports a tap mode
as well as a ‘DANZ’ software suite for loss/latency monitoring, packet filtering and mirroring.
The DANZ suite is now available on the 7150, 7280E and 7500E series switches from Arista,
opening up options from 1/2U fixed form to 7/11U chassi based deployments.
The 7150 series gives some nice features:
Port density; up to 64 10G ports
LANZ+ features for micro-burst analysis
PTP support for packet time stamping
~350ns latency for all packets
Multi-port mirroring
Hitless (ISSU) upgrades
A ‘normal’ Linux environment and Python based toolset
You configure it like any other switch!
Let’s implement a solution!
We’ll keep it simple with 1 aggregation point, 6 inputs and 2 outputs.
In reality, this could be multiple levels of aggregation.
This will place all switch ports into error disabled and enable tap/tool ports.
Next, define our tap ports
switch(config)#interface ethernet 1
switch(config-if-Et1)#description Transit Provider 1
switch(config-if-Et1)#switchport mode tap
switch(config-if-Et1)#switchport tool group INTERNET
switch(config)#interface ethernet 2
switch(config-if-Et2)#description Transit Provider 2
switch(config-if-Et2)#switchport mode tap
switch(config-if-Et2)#switchport tool group INTERNET
switch(config)#interface ethernet 3
switch(config-if-Et3)#description Corporate Spine 1
switch(config-if-Et3)#switchport mode tap
switch(config-if-Et3)#switchport tool group CORPORATE
switch(config)#interface ethernet 4
switch(config-if-Et4)#description Corporate Spine 2
switch(config-if-Et4)#switchport mode tap
switch(config-if-Et4)#switchport tool group CORPORATE
switch(config)#interface ethernet 5
switch(config-if-Et5)#description Wan Provider 1
switch(config-if-Et5)#switchport mode tap
switch(config-if-Et5)#switchport tool group WAN
switch(config)#interface ethernet 6
switch(config-if-Et6)#description Wan Provider 2
switch(config-if-Et6)#switchport mode tap
switch(config-if-Et6)#switchport tool group WAN
We now have 3 groups with 3 interfaces in each.
Finally, map those groups onto our outputs.
switch(config)#interface ethernet 10
switch(config-if-Et10)#description Bro Network Security Monitor
switch(config-if-Et10)#switchport mode tool
switch(config-if-Et10)#switchport tool group set INTERNET WAN
switch(config)#interface ethernet 11
switch(config-if-Et11)#description Secret Server
switch(config-if-Et11)#switchport mode tool
switch(config-if-Et11)#switchport tool group set CORPORATE
Et10 + Et11 will now receive any traffic sent to their relevant groups.
Advanced features
In the demo, we have a static input -> output allocation in real life this can be
software controlled or filter based.
We could also truncate packets to look at only their headers,
potentially useful for encrypted traffic.
A simple traffic steering example is as below:
For traffic coming into Eth1
Match traffic targeting 8.8.8.8
Send to tap group GOOGLE_DNS
Send tap group GOOGLE_DNS to Eth20
switch(config)#ip access-list ACL_GOOGLE_DNS
switch(config-acl-ACL_GOOGLE_DNS)#permit ip any 8.8.8.8/32
switch(config)#class-map type tapagg match-any TAP_CLASS_MAP
switch(config-cmap-TAP_CLASS_MAP)#match ip access-group ACL_GOOGLE_DNS
switch(config)#policy-map type tapagg TAP_POLICY
switch(config-pmap-TAP_POLICY)#class TAP_CLASS_MAP
switch(config-pmap-TAP_POLICY-TAP_CLASS_MAP)#set aggregation-group GOOGLE_DNS
switch(config)#interface ethernet 20
switch(config-if-Et20)#description Magic Box
switch(config-if-Et20)#switchport mode tool
switch(config-if-Et20)#switchport tool group set GOOGLE_DNS
switch(config)#interface ethernet1
switch(config-if-Et1)#service-policy type tapagg input TAP_CLASS_MAP
For software-based control, Arista provides a powerful HTTP API as well as
XMPP client support and ‘on-device’ APIs. The Python eAPI client can be found on
GitHub, with some examples.
Summary
It is cost effective to deploy tap infrastructure where required.
The offering from Arista is very powerful, allowing flexibility to meet any number
of requirements.
With a number of integration options (JSON API, Python API, XMPP client),
having dynamic tapping capabilities is a nice spanner to have in your toolbox.
And the cost? Basically, about the same as a 10G switch with routing functionality.
Naturally, the cost is varied depending on physical requirements (fibre/copper runs),
port speeds (1/10/25/40/100G) and port count; this is applicable to any switch or
other tap based deployment.
Footnote
I would advise any tap deployments to be carefully planned, certain topologies,
such as multi-stage clos networks do not make good tap targets, both due to the number
of links involved and the resulting bandwidth requirements for the TAP infrastructure.
Depending on your goals, tapping at natural points of congestion, such as
the ingress/egress points for your high performance/bandwidth network segments
(transit links, firewalls etc), will likely provide highly useful information at
a vastly reduce capture complexity.
During the latest migration, the tool accounts where re-created from scratch. This post will outline how things are configured for a point of reference in the future.
Overview
Accounts:
tools.cluebot - Legacy account - only a web service redirect is running
tools.cluebot3 - Dedicated account for ClueBot III
tools.cluebotng - Account for all things related to ClueBot NG
A manual setup is required to create the config file.
This should be created as ~/cluebot3/cluebot3.config.php under the tool account, containing the below:
<?php$owner='Cobi';$user='ClueBot III';$pass='Clearly this is not the actual password';$status='rw';$maxlag=2;$maxlagkeepgoing=true;
In your local git clone, you can now do a full deploy (this starts the bot):
fab deploy
Things should be running, check the job status/logs under the tool account to confirm this.
tools.cluebot3@tools-bastion-03:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
xxxxxxx 0.30169 cluebot3 tools.cluebo r 03/04/2017 10:23:33 continuous@tools-exec-xxxx.eqi 1
tools.cluebot3@tools-bastion-03:~$ tail-f ~/logs/cluebot3-2017-03-05.log
cluebot3.INFO: doarchive(xxxxxxxxxxxx)[][]
You should also see edits from the bot after a while (a number of index pages need to be checked first).
Debugging
The main bot log is normally somewhat insightful. Sometimes the bot will die due to memory usage when processing large pages,
this generally isn’t seen in the logs and is hard to replicate running manually; looking for restarts is an indicator.
ClueBot NG
This bot is slightly more complicated and has numerous parts. The main part is fully scripted, but a number of (non-obvious) configs need to be in place.
Architecture
This is not a depiction of request flow, but service dependencies.
Critical services for basic bot functionality include:
Wikipedia API (For downloading changes + reverts)
Tools DB (For creating vandalism IDs + recording action)
Wikipedia DB Replicas (up to date) (For fetching extra metadata)
Wikipedia IRC RC Feed (For the change feed)
Main Bot (Processor)
Core (For edit scoring)
Setup
The server-side setup, assuming a clean account can be done following the below (locally):
A number of config files need to be created manually.
~/.cluebotng.password.only
The Wikipedia ClueBot NG user password
~/apps/bot/bot/cluebot-ng.config.php
<?phpnamespaceCluebotNG;classConfig{publicstatic$user='ClueBot NG';publicstatic$pass=null;publicstatic$status='auto';publicstatic$angry=false;publicstatic$owner='Cobi';publicstatic$friends='ClueBot,DASHBotAV';publicstatic$mw_mysql_host='enwiki.labsdb';publicstatic$mw_mysql_port=3306;publicstatic$mw_mysql_user='s52585';publicstatic$mw_mysql_pass='a password that is actually real';publicstatic$mw_mysql_db='enwiki_p';publicstatic$legacy_mysql_host='tools-db';publicstatic$legacy_mysql_port=3306;publicstatic$legacy_mysql_user='s52585';publicstatic$legacy_mysql_pass='a password that is actually real';publicstatic$legacy_mysql_db='s52585__cb';publicstatic$cb_mysql_host='tools-db';publicstatic$cb_mysql_port=3306;publicstatic$cb_mysql_user='s52585';publicstatic$cb_mysql_pass='a password that is actually real';publicstatic$cb_mysql_db='s52585__cb';publicstatic$udpport=3334;publicstatic$coreport=3565;publicstatic$fork=true;publicstatic$dry=false;publicstatic$sentry_url=null;}
~/apps/report_interface/web-settings.php
<?PHP$dbHost='tools-db';$dbUser='s52585';$dbPass='a password that is actually real';$dbSchema='s52585__cb';$rcport=3333;$recaptcha_pubkey="something here";$recaptcha_privkey="something here too";
~/apps/bot/relay_irc/relay_irc.conf.js
exports.nick='CBNGRelay';exports.server='irc.cluenet.org';exports.extra=['OPER antiflood This Is Not The One You Are Looking For',];
Re-Deploy
Now the config files are in place, the bot should actually work.
In your local git clone, complete another deploy, to restart everything:
fab deploy
Bot Checks
First, check the job status
tools.cluebotng@tools-bastion-03:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
xxxxxxx 0.30159 lighttpd-c tools.cluebo r 03/04/2017 12:48:57 webgrid-lighttpd@tools-webgrid 1
xxxxxxx 0.30154 cbng_relay tools.cluebo r 03/04/2017 13:32:09 continuous@tools-exec-xxxx.too 1
xxxxxxx 0.30154 cbng_core tools.cluebo r 03/04/2017 13:32:11 continuous@tools-exec-xxxx.too 1
xxxxxxx 0.30092 cbng_bot tools.cluebo r 03/04/2017 23:56:38 continuous@tools-exec-xxxx.too 1
The main bot log provides a good indicator as to the source of problems but has limited data due to the logging volume.
It is common to see ‘Failed to get edit data for xxx’, this is only a problem if it’s happening for a large number of changes; normally this is due to delayed replicas, causing the user/page metadata for new users/pages to be non-existent.
The relays generally don’t break but may have incorrect entries in the database. The simplest fix is to kill the job and let it re-spawn.
The report interface will likely break due to PHP being updated, it will need fixing randomly; there is a motivation to rebuild the interface to include the review functionality as well as OAuth based authentication (T135323).