Read and write operations for memory-heavy cyber security data using MMAP

I really enjoy living in the woodland area. You can walk for hours without meeting anyone, which helps to clear up your mind of the constant thinking and pictures it brings along. Sometimes you just need to stop, focus on priorities and push the other thoughts to your brain’s long-term memory storage. That is what we do. And we can even teach our programs to learn that ability as well.

In the cyber security and artificial intelligence area, one of the most common tasks is to store filtered data for future analysis. For instance, when you study behavior of users within your system, you not only need long series of data monitoring user activity, failed logins, permission violation and so on, but, at the same time, you have to analyze the data in real-time with maximum delay being a couple of seconds since the time the data first appeared. To put it simply, you need to be fast.

In order to do so, your program has to find a way to quickly recover data from long-term memory, analyze them in real-time, and leave them, if they are not the priority anymore (there is no security incident detected as the result of the analysis). You can start by keeping the data in the limited short-term dynamically allocated memory, but they will be lost, once the program is restarted or killed.

Of course, the next step could be file persistence, when you can periodically store the data in the long-term memory (filesystem). However, what if the program stops at the time, when a security incident is evaluated? There would be no time for the periodic store to persist the data, hence you would lose the result of the analysis. And some additional orchestration including keeping only relevant data in the limited dynamically allocated memory would be way too slow.

There is, however, a native and efficient approach to keep the data synchronized between memory and files (thus avoiding data loss on restart), while utilizing the limited short-term memory only for actually used, prioritized and analyzed data. And this native functionality is called swap. What does it mean? Swap happens, when the applications running on the operating system consume all the available RAM memory. The operating system then creates a swap file, where it stores some of the less used memory pages. Hence the memory available to the application, the so-called virtual memory, is actually larger than the physical RAM space. There are obviously performance issues connected with swap and thus MMAP, that is why it should not happen very often.

If we just take the swap function, additionally instruct it to mirror everything from the allocated memory to files, we get the functionality we need: Fast, low-level way to persistently store large data, while keeping them instantly available for real-time analysis. On UNIX systems, this “instruction” is quite easy, achieved via calling the memory map (MMAP) function. The signature of the MMAP function is straightforward:

In our case, we are not interested in a specific location in the available memory space our mapping should take place, hence the first argument called address can be NULL. It is up to the kernel to choose the place for the memory mapping. The second argument called length specifies the amount of the memory we want to use. And here comes the trick: the length of the memory used is the same as the size of the file, that we want the mirroring to synchronize the data in. So the first step would be to create a file, whose size must be a multiple of a memory page size (usually 4096 bytes), and then use the size of the file here as the length of the memory.

The third argument called protect must be set to allow both read and write operations in the virtual memory, hence always set to PROT_READ | PROT_WRITE. The most interesting argument is the fourth one named flags, that can instruct the MMAP to use the swapping/mirroring ability. To do so, set the flags to MAP_SHARED, which means that all changes in the memory will be also written to the file and, on top of that, can be used by other processes (here comes the keyword: shared), which is especially useful for parallel tasks.

The fifth argument is a file descriptor of the open file, we want the mapping to store its data to and whose size is the same as the length of the memory specified in the second argument. File descriptor can be obtained via calling the system function open. You see, MMAP is really a fast low-level function, since it requires only the file descriptor instead of more advanced library structures such as FILE obtained via fopen. File descriptor received from open gives MMAP an access to the file without any other orchestration. The last argument, offset, we will just set to zero, since we want to map the entire file, not only its parts (if you need some more orchestration, it is better to do so using structures, since offset is not flexible, because it must always be a multiple of the memory page).

The return value of MMAP is the pointer to the allocated memory, so the behavior is similar to malloc (however, this “virtual memory” cannot be actually freed). The return value can also be MAP_FAILED, if the mapping fails for some reason. Here is a sample program in C using MMAP:

This sample code creates a file named mymemoryfile.bin (if it does not already exist) and sets its size to page_size * 1024 bytes using fallocate, which is a command that works on modern file systems and basically allows to specify the file’s virtual size without being filled with actual bytes, the so-called sparse file. There should be additional checks for every function to see if the opening of the file, allocation etc. was successful, but for the purposes of this sample I decided to avoid them.

Afterwards, the MMAP is called in the way described above with the file and the MAP_SHARED flag. Then the content of the memory is printed to standard console output as string, modified (snprintf) and printed out again. The most interesting part comes, when you try to rerun the application. During the first run, the mymemoryfile.bin is not yet created, hence the output looks as follows:

When the memory is used to add the “modification time” message, the result is shown in the output. Now, when you rerun the application again, because of the automatic mirroring of the memory to the file, you should see the message as the “old memory content”:

And yet one more run:

The old content of the actual run is always the same as the new content of the previous run. When you delete the mymemoryfile.bin file and run the application again, there is no content available:

Since we are storing only ASCII characters as bytes in the memory/file, you should be able to see the content of the file simply by using cat command:

That’s it! Just play with the code, try to set different MMAP flags (see: https://linuxhint.com/using_mmap_function_linux/) and measure the performance. In the following article, I would like to continue in the topic and show you more specifically, how MMAP can be used in the cyber security area.

One thought on “Read and write operations for memory-heavy cyber security data using MMAP

Leave a Reply

Your email address will not be published. Required fields are marked *