System Design of an Incident Response Platform

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

Table of contents

Introduction

In this article at OpenGenus, we'll discuss how to go about creating an Incident Response Platform from the ground up, explaining each concept along the way.

An Incident Response Platform or an IRP is designed as an approach to managing security incidents. It is great for small and large organizations alike, since one of its main use cases is to detect such incidents when they happen, alerting everyone involved, thus inciting communication among team members.

It's great both for anything to do with incidents that may occur, such as detecting, classifying and resolving them, but also for collecting information that may help the organization keep future threats away.

All in all, an incident response platform is a great thing to have in an organization, since it can help security teams respond to threats efficiently, spend less time worrying about already-documented problems and more time towards harder challanges, thus reducing the canvas where threats may occur and minimizing their overall impact.

Now that we understand all benefits of such a platform, we shall see one can be built.

Incident Response Platform

Functionality

Let's talk about functionality first. We have briefly mentioned it before, but let's break it down into specific steps now:

  • Select monitoring targets
  • Detect incident
  • Look into incident
    • Classify incident
  • Respond to incident
    • Contain incident
    • Eliminate incident
  • Recover from incident

High-level design

It's time to take each function and consider how it might work within our platform.

Select monitoring targets

First of all, our product needs to let users make their own decisions about what they want to protect within their systems. There may be certain data they don't want us to see or just aren't that concerned with protecting.

Regardless of the reason, we should let users select what it is they want to protect, and then perhaps save those into our own database.

If we wish to be really accomodating, we can start researching what users exactly need, so we can have predefined options that they can choose from. For example, some monitoring needs might be:
* Network infrastructure
* Database
* Application server
* Data and information assets

We could also add possible threats for each one of them as we go, which will definitely help us in detecting incidents when they actually show up. For our examples above, some threats might be:
* Malicious software (viruses etc)
* SQL Injections
* Application Layer Attacks
* Data breaches

Detect incident

Now, once we have established what we need to protect and some threats that may occur to each one, we actually need to set up some functionality to detect the latter.

In order to detect these incidents, we're going to need patterns that could indicate threats. Our application will require agents, possibly at every end-point of our monitored targets.

These agents will have these established patterns to detect any kind of incidents. For example, they will be able to monitor specific malware behaviours, such as injections; suspicious IP ddresses; failed login attempts; and many other factors that may point to some threat.

Anytime agents will come across anything suspicious, they will go ahead and generate an alert and then report it to a Detection service.

Example of how this may work:
Let's say a user tries to access some kind of system files that are not available to the public. Such directories will usually be monitored by the IRP. There will be specific criteria within the app to detect this. Then, an alert will be generated, and then the worker will send a warning to the Detection service, that will go ahead and add them to a queue that will escalate the event further into investigation.

Look into incident

Now, that we've identified some incidents, we'll probably have stored them in the queue, waiting to be investigated.

We are going to use specific workers for the investigation. Each worker will take the first element in the queue of detected threats and start by trying to classify its severity:

Classify incident

We have talked enumerated several possible threats before. We should have priority set in place for each of them, that we can combine with other information, such as the importance of the data involved and the level of risk in order to compute the final priority.

Let's say we'll our app will classify the incidents into first-priority, second-priority, and third-priority, first being the most severe, second being moderate and third being on the borderline of harmless.

Once a worker is done classyfing, it will assign each incident its own priority in the database.

The incidents will go on to an Investigation service, which will sort them in 3 queues based on their priority, which will aid us greatly when responding to them.

Respond to incident

We're going to need several servers here, as well, just like we had agents and workers for the other two functionalities. We'll just call them working servers here, as to avoid confusion.

We've discussed how the Investigation Service will send the data to this next step in 3 different queues, based on priority.

The first instinct here may be to assign all working servers to responding to the most severe incidents. And, sure, that may be an approach. But if we have a lot of incidents, we may never get to the less severe ones.

We could, instead, direct 1/2 (3/6) of the working servers to dealing with the first-priority incidents, 1/3 (2/6) of the working servers to dealing with the moderate problems, and the rest of them directed at the third-priority ones.

Working servers
How will working servers function? We've established before that they need to do two things: contain and eliminate incidents.

Containing incident

Here, we'll need to have some patterns set up in order to contain incidents. Let's discuss a bit and give some examples on how specific situations should be set up to be contained.

  • Isolate -> isolating the affected systems. To do this, we could do something like network segmentation(Network segmentation.
  • Disable accounts -> to do this, we could either reset the account password, or just suspend the said account
  • Block malicious traffic -> For this, external services would be easier to use, such as Intrusion Detection Systems (Intrusion Detection Systems or Web Application Firewalls (Web Application Firewalls]

Eliminating incident

Once the incident is contained and, thus, no longer a pressing issue, the working servers can finally eliminate it.

There will also be certain strategies set in place here. For instance:

  • Removing malicious malware all together
    • Identify the type of malware and its indicators of compromise(Indicators of Compromise)
    • Remove the malware -> This can be done using antiviruses

Recover from incident

Now, the last step is recovering from the incident. Here, based on the incident that occurred, we could employ a various number of strategies.

For example, if the incident was a data breach, we could use a dedicated Data Backup Software in order to get our files, folders, images, or databases back.

Also, for such problems, our platform should 'learn' to do protect products better, possibly starting to store local backups of our data using a Network-Attached-Storage or even setting up a backup schedule in order to ensure that data is always intact.

Moreover, in order to increase security even further, our platform could also review the security settings of the given product, thus strenghtening its resilience against further cybersecurity attacks. Furthermore, security policies could be added.

Database solution

Great! Now that we've thoroughly thought out each function the platform needs to have, we can do something more hands-on, like creating a database solution.

First of all, we're going to need to store each organization we're providing our services to. Let's call this Product for the sake of simplicity. Our Product object should have an id of its own and, possibly, a name, and any other relevant information.

The second object will be called MonitoredPart. This refers to any kind of the application that is selected as a monitored target, and is from here we know at which endpoints agents should be set up at. This object needs to contain an id, a foreign key id to point to its Product, and any other fields that may be needed, such as a name.

We'll also need a Threat object. This one will store common threats, which will aid our agents in detecting incidents faster, but also help our workers to classify those incidents more accurately. This will only require an id, a foreign key to the monitored part it targets, some kind of a name, and an estimated priority. Feel free to add any other fields you feel are necessary.

Now, a concept our applications wouldn't even exist without, and can even be found in its name, we need an Incident model. This one should contain an unique id, the id of the monitored part where it occured, the id of the threat it corresponds to, if any, and the computed priority.

Lastly, we mentioned having three types of servers to do several kinds of work: agents, workers, and working servers. In reality, they're all servers, but are just set up to do different computations. We'll also need to store them in the database.

For the Agent model, we're going to need an unique identifier, some kind of a status to figure out whether it should take on another task or not, and a foreign key to the MonitoredPart it was assigned to in the beginning.

The Worker model requires an id, a status field, and the foreign key id of the Incident it's investigating.

Finally, since working servers require the same information as workers, we could just add a type field to the Worker model in order to avoid redundancy. Thus, when they are accessed, they can be filtered with this type field.

Here is a visual representation of the database solution:

Capacity estimates

Let's assume our platform has a large base of users and events computed.
A good estimation would be to assume that we had about 100 products which use our services at any given time.

Let's say that each device has an event rate of about 1 event per second.

100 products * 1 event / second = 100 events / second overall

100 EPS * 60 seconds/minute * 60 minutes/hour * 24 hours/day = 8,640,000 events / day

Explanation:
Calculating what 100 events per second means in terms of one day.

Storage

8,640,000 * 1KB = 8,640,000 KB =~ 8,64 GB

Explanation:
If we assume that each event takes up about 1KB of data, then we'll require 8,64GB worth of storage.

Number of servers needed

Throughout the article, we've talked about several servers, to which we've refered as agents, workers, and working servers. Let's try to find how many of those we need for 8,64 million events daily.

Agents:

10 events/second * 86400 seconds in a day = 864,000 events / agent
8,640,000 / 100,000 events per agent = 86,4 ~ 90 agents

The agents are going to check every one of the 8,64million events. Assuming that each event takes about 100ms (0.1 seconds) to check, then in one given day, one worker is going to be able to check 864,000 events. Thus, in order to check all 8,64m events, we'll need approximately 90 agents.

Workers:
We used workers for investigating events, but what we have to take into account is the fact that some events don't end up being detected as malicious. Let's assume only 1/30 of the events are actually malicious.

8,640,000 overall events/ 30 = 288,000 malicious events
86400 seconds in a day / ~30 seconds per investigation = 2880 checks per worker
288,000 / 2880 = 100 workers

Explanation:
If only 1 / 30 events are malicious, then we can assume that only 288,000 end up being detected.
If there are 86400 seconds in a day and it takes roughly 30 seconds to do an investigation, that means that each worker can perform 2880 checks daily.
If we need 288,000 checks done daily and each worker does 2880, that means we need 100 workers for proper functionality.

Working servers
We used working servers for responding to incidents. These times can be just as various as the investigation of events, so we can fairly assume that this computation's result would be the same as the workers'. This would put us at 100 working servers

Results:
Adding all of those servers up: 90 + 100 + 100 = 290 servers

Concepts and technologies

Network segmentation

This is a strategy that helps contain security breaches, by compartmenting the networks, thus giving devices within one of them no access to other segments. Thereby, an attacker can only do so much damage by gaining access to a single network segment.

In order to implement this, we can employ technologies such as:

  • VLANs
  • Firewalls - enforce access policies and filter traffic based on predetermined rules
  • Acces Control Lists - filter traffic at router level

Intrusion Detection Systems

An intrusion detection system is an external service that can be used in order to identify incidents, or, taking in to account the case for which we mentioned them in the article, respond to incidents. They can take actions to respond to threats, by blocking malicious actions or isolating breached systems.

Some popular intrusion detection systems, include, but do not limit to:

  • Snort
  • Suricata
  • Zeek

Web Application Firewalls

Web Application Firewalls are comonly used to protect websites from a great deal of cybersecurity threats, such as data breaches.

They perform traffic filtering, thus analyzing traffic in order to block malicious requests, such as SQL injections. Moreover, they use sets of rules in order to identify attack patterns.

There is a wide range of web application firewalls, some of them being:

  • ModSecurity
  • Imperva WAF
  • Amazon Web Services WAF
  • Barracuda WAF

Indicators of compromise

Indicators of compromise aren't actually technologies, unlike the other concepts mentioned, but rather pieces of information that indicate a secrity incident. They can be used for our incident response platform in order to identify the threats.

Some examples of indicators are:

  • File indicators -> such as file names
  • Network indicators -> such as suspicious IP addresses or urls
  • Behavioural Indicators -> suspicious activities such as a multitude of failed login attempts

Data Backup Software

A data backup software is designed in order to facilitate safety against data loss, allowing organizations to recover data in the event of cybersecurity incidents.

Thus, in the event of a system failure or a major data loss, a backup software can be of great assistance in the recovery process.

Here are some examples of data backup softwares:

  • Duplicati
  • Bacula
  • Amanda

Network-attached Storage

A Network-Attached storage is a storage that facilitates centralized data storing over a network. Such a service is great for performing automated data backups, therefore ensuring data integrity and availability.

Some popular network-attached storage devices are:

  • Synology NAS
  • QNAP NAS
  • Asustor NAS

Security policies

These are also rather pieces of information than technologies, but important nonetheless.

Security policies are fairly significant in safeguarding against sensitive data and ensuring that applications can effectively respond and recover from incidents.

Some of the most important security policies that could be set up are as following:

  • Deny Unauthorized Services by Default - denying traffic from services that are not important. For example, denying traffic to port 25 prevents email spam.
  • Using segmentation rules - this implies compartmentalizing the network, which we've already covered
  • Establish VPN access rules - defining rules to allow VPN connections
  • Set up Geo-Blocking Rules - implementing rules to deter traffic from specific locations that do not align with the client's operations or are specifically known for malicious activities.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.