Machine learning (ML) is taking cybersecurity by storm nowadays as well as other tech fields. In the past year, there has been ample information on the use of machine learning in both defense and attacks. While the defense was covered in most articles (I recommend reading " The Truth about Machine Learning in Cybersecurity"), Machine Learning for Cybercriminals seems to be overshadowed and not unanimous.
Nonetheless, the U.S. intelligence community concerns itself about the use of artificial intelligence. The recent findings show how cybercriminals can deploy machine learning to make attacks better, faster, and much cheaper to perform.
The objective of this article is systemizing information on possible or real-life methods of machine learning deployment in malicious cyberspace. It is intended to help members of Information Security teams to prepare for imminent threats.
All cybercriminals' tasks that can be aided by machine learning, starting with initial information gathering, to cause a system compromise can be categorized into several groups:
- Information gathering - preparing for an attack.
- Impersonation - attempting to imitate a confidant.
- Unauthorized access - bypassing restrictions to gain access to some resources or user accounts.
- Attack - performing an actual attack such as malware or DDoS.
- Automation - automating exploitation and post-exploitation.
Machine Learning for Information Gathering
Information gathering is the first step for every cyberattack, no matter if it's a targeted attack or one on multiple victims. The better you collect information, the better prospects of success you have.
As for fishing or infection preparation, hackers may use the classifying algorithms to characterize a potential victim as belonging to an appropriate group. Imagine, after having collected thousands of emails, you send malware only to those who are more likely to click on the link recognizing it as unsuspicious, thus reducing the chances of a security team's participation. A number of factors may aid here. As a simple example, you may separate users who write about IT topics on their social networking sites from those focused on food and cats. As an attacker, I would choose the latter group. Various clustering and classification methods from K-means and random forests to neural networks can be used.
Concerning information gathering for targeted attacks, there is just one victim and complex infrastructure, and the mission is to get as much information about this infrastructure as possible. The idea is to automate all obvious checks including information gathering about the network. While existing tools such as network scanners and sniffers enable analyzing traditional networks, the new generation of networks based on SDN are too complicated. That's where machine learning can assist adversaries. A little-known but interesting concept here is the Know Your Enemy (KYE) attack, which allows stealth intelligence gathering about the configuration of a target SDN network; this is a relevant example of applying machine learning to the information gathering task. The information that a hacker can collect ranges from the configuration of security tools and network virtualization parameters to general network policies like QoS. By analyzing the conditions under which a rule from one network device is pushed into the network and the type of the rule, an attacker can infer sensitive information regarding the configuration of the network.
During the probing phase, which consists of a number of the attacker's attempts to trigger the installation of flow rules on a particular switch, the specific characteristics of the probing traffic depend on the information that interests the hacker.
In the next phase, the attacker analyzes the correlation between the probing traffic generated during the probing phase and corresponding flow rules that are installed. From this analysis, he or she can infer what the network policy is enforced for specific types of network flows. For instance, the attacker can figure out that the defense policy is implemented by filtering network traffic if he or she uses a network scanning tool in the probing phase. If you do it manually, it can take weeks to collect data and still you will need algorithms with preconfigured parameters, e.g. how many particular packets are necessary to make a decision is difficult to determine as the number depends on various factors. With the help of machine learning, hackers can automate this process.
Those are two examples but, generally, all information gathering tasks that require a great deal of time can also be automated. For example, DirBuster, a tool for scanning for available directories and files, can be improved by adding a kind of genetic algorithms, LSTMs or GANs to generate directory names that are more similar to existing ones.
Machine Learning for Impersonation
Cybercriminals use impersonation to attack victims in various ways depending on a communication channel and a need. Attackers are able to convince victims to follow the link with exploit or malware after having sent an email or using social engineering. Therefore, even a phone call is considered a means for impersonation.
Email spam is one of the oldest areas in security where machine learning was used and here I expect ML will be one of the first areas applied by cybercriminals. Instead of generating spam text manually they can "teach" a neural network to create spam that will look like a real email.
However, while dealing with email spam, it is hard to behave like a man who you imitate. The point is that if you ask employees in an email to change their passwords or download an update on behalf of a company's administrator, you would not manage to write it in exactly the same way as the administrator. You won't be able to copy the style unless you saw a pile of his or her emails. Even so, this issue can be solved by network phishing.
The biggest advantage of social media phishing over email phishing is publicity and ease of access to personal information. You can watch and learn users' behavior by reading his or her posts. This idea was proved in the latest research called Weaponizing Data Science for Social Engineering - Automated E2E spear phishing on Twitter. This research presented SNAP_R, which is an automated tool that can significantly increase phishing campaigns. While traditional automated phishing gives 5-14% accuracy and manually targeted spear phishing - 45%; their method is right in the middle with 30% accuracy and up to 66% in some cases with the same effort as an automated one. They used a Markov model to generate tweets based on a user's previous tweets and compared results with the current neural network, particularly LSTM. The LSTM provides higher accuracy but requires more time to train.
In the new era of AI, companies create not only a fake text but also a fake voice or videos. Lyrebird, a startup specializing in media and video that can mimic voices, demonstrated that they can make a bot that speaks exactly like you. With the growing amount of data and evolving networks, hackers can present better results. Since we don't know how Lyrebird works, and hackers probably aren't able to use this service for their own needs, they can discover more open platforms such as Google's WaveNet, which are able to do the same things.
They apply generative adversarial networks (GANs), more advanced types of neural networks.
Tune in tomorrow when we'll cover how bad guys could potetially use machine learning for gaining unauthorized access and for perpetrating attacks.