Simple Threat Analysis with Pandas¶
This Jupyter notebook is available at https://github.com/donadatum/Python_Projects
This notebook demonstrates a simple way to see if egress (outbound) transactions from wire data traffic are terminating a potential malicious sites. The Python software library pandas is used to compare the IP addresses of servers exfiltrating from a mock set of IP addresses from businesses against IP addresses of malicious sites.
To represent ‘known bad actors’, this example uses IP addresses that have been reported within the last 48 hours as having run attacks on the service Mail, Postfix at the Blocklist website: https://lists.blocklist.de/lists/mail.txt, downloaded Jan 9, 2018.
import pandas as pd
%matplotlib inline
Load the data into pandas using pd.read_csv. Our data uses latin-1 encoding, depending on your source you may need to specify another option, such as utf-8 or utf-16.
egress=pd.read_csv("Traffic.csv", encoding='latin-1')
Examining the first few records we can see that the Server Address column is the destination ip address for outbound (egress) transactions.
egress.head(5)
We can quickly identify the countries and volume of traffic associated with the egress servers using the column ‘Dest Country’, which stands for Destination Country.
Index into the ‘egress’ dataframe and select the column, this creates a pandas series object, assign it to variable ‘servers’.
Using the value_counts() method on the servers object gives us a count for each country. The US was the most frequent destination country (9580), followed by Norway (291) and Ireland (51).
servers=egress['Dest Country']
servers.value_counts()
We can plot the distribution, by using the .plot() method and specifying a horizontal bar chart. However in this example it is not is very useful because of the very high frequency of US IP addresses compared with other countries.
egress['Dest Country'].value_counts().plot(kind='barh')
What if we wanted to look at the traffic from a particular country, such as France? We could use indexing combined with the the pandas .isin() method. This is similar to the SQL LIKE operator. The produces a new dataframe ‘France’.
France=(egress.loc[egress['Dest Country'].isin(['France'])])
France.head(2)
type(France)
Using the describe method we can see that there are 13 transactions – this is the same number identified using .value_count() above. All of these transactions are from the Santa Clara Campus, the most frequent destination is Ale-de-France and 188.165.39.118 is the most common server address.
France.describe()
Server Addresses¶
Lets return to the full ‘egress’ dataframe and select all data in the ‘Server Address’ column, assigning it to the variable ‘server_ip’
server_ip=egress['Server Address']
Using the .describe() method we see that there are 10,000 records, 691 unique server addresses, the top (highest frequency) IP address is 52.70.108.65, which occurs 225 times.
server_ip.describe()
We can use .value_counts() to count the frequency of each IP address and .head(20) to limit the results to the top 20.
server_ip.value_counts().head(20)
This time the bar graph offers greater visualization power.
server_ip.value_counts().head(20).plot(kind='barh')
Right now server_ip is a pandas.Series object, we can use .type() to confirm that. If you want this data in dataframe format, use .to_frame() method to convert.
type(server_ip)
server_ip=server_ip.to_frame()
type(server_ip)
Using .tail() we can confirm that our new server_ip dataframe still has 10,000 records.
server_ip.tail(3)
The Bad Guys¶
The file ‘badips.csv’ contains a list of malicious IP addresses from the Blocklist website. We use pd.read_csv to bring the data into a pandas dataframe. We specify that there is no header and assign the column name ‘Server Address’ to our 1 column dataframe.
blocklist = pd.read_csv('badips.csv', header=None, names=['Server Address'])
The blocklist dataframe has 20,134 Server Addresses, all of them are unique as each address represents a different site.
blocklist.describe()
blocklist.tail(5)
To see how many of our egress transactions in the ‘egress’ dataframe are terminating at malicous sites from the ‘blocklist’ dataframe – we merge the two dataframes.
We use pd.merge and specify an inner join on the ‘Server Address’ column.
Congratulations! The resulting joined_ips dataframe is empty. There are no transactions terminating at a known malicious site.
joined_ips=pd.merge(egress,blocklist, on='Server Address', how='inner')
joined_ips.info()
joined_ips.tail(20)
Lets add some fake data (like fake news but better!) into our blocklist to make sure the join is working. I have taken 10 IP addresses from the ‘egress’ data frame and created a file ‘fakebadservers.csv’. After loading this file using pd.read_csv I concatonated this dataframe to ‘blocklist’ to create ‘fakeblocklist’.
fakeblocklist=pd.read_csv('fakebadservers.csv', names=['Server Address'])
fakeblocklist=pd.concat([blocklist, fakeblocklist], axis=0)
fakeblocklist.info()
fakeblocklist.head(3)
Lets try the merge again, only this time using ‘egress’ and ‘fakeblocklist’. This merge identifies 580 transactions associated with a known ‘bad actor’ – that is the Server Address in egress matches the server address in ‘fakeblocklist’.
joined_ips2=pd.merge(egress,fakeblocklist, on='Server Address', how='inner')
joined_ips2.head(5)
joined_ips2.info()
Conclusion¶
This is just a simple example, but I think pandas offers great potential for threat detection. Next step – API integration.