How to store cyber security data using MMAP in C
In the previous article, I described the importance of having fast access to a memory that is also being automatically persisted to the file system to use more space than the actual RAM size and, at the same time, to avoid data loss, when the program is being restarted or killed. This functionality can be for big data, where you more or less randomly access them, achieved using the system function called MMAP. It needs to be properly instructed with flags and provided with the descriptor to the already created large file with its size being a multiple of the memory page. Before continuing, please refer to the previous article for further clarification.
The best thing about the return value of the MMAP is that it is really a pointer to an allocated memory, where you can store any type of object like structures. In the cyber security area, you usually want to have data organized into structures that are time and dimension specific. What do I mean by that? Well, when we are monitoring user behavior, we usually want to have the data organized into “chunks” by the time the user action takes place. The “chunk” can be 10 seconds or 5 minutes long based on the types of analysis we want to perform. On top of that, each “chunk” is specific to the dimension, which in our case can be the user name or other possible dimensions somehow related to the identity of the user and their actions. These “chunks” are then usually organized into larger blocks due to performance. Let me explain it more.
The data, that come from monitoring the activity of an entity like the user, are usually sequential from the time perspective. Imagine, that in the time zero, there are a few documents describing the user behavior, then in the time 5 seconds after time zero there are user activities etc. There are usually gaps, but they are not big. Analysis then watches the data from subsequent times and performs aggregations like sum, mean spike and so on. In order for the analysis to be really fast, the sequential data need to be located in the memory one after the other based on their time. If the analysis should jump from one block of the memory to another with completely different offsets, the performance would be lost just on the hopping between memory offsets. That is why each “chunk” must be stored in larger segments, which can be seen as arrays of sequential “chunks” (meaning, the first “chunk” represents data from some time zero, the second “chunk” data from time 5 after zero, the third “chunk” data from time 10 etc.).
The following sample shows this idea in a very simple way. Each “chunk” is simply an array of characters with some maximum length. This array can be filled with IDs of user activities separated by commas. Then every segment, which is the structure actually being stored in the memory, is just an array of “chunks” with some metadata (start time of the segment in this case):
1 2 3 4 5 6 7 |
#define USER_CHUNK_MAX_LENGTH 64 #define USER_CHUNKS_MAX_COUNT 30 struct segment { uint64_t time; char user_chunks[USER_CHUNKS_MAX_COUNT][USER_CHUNK_MAX_LENGTH]; }; |
It is always better to define fixed size of arrays, if possible. There are many security implications for having dynamic arrays, especially when their size is dependent on user input. The same applies for strings, that is why I always use the snprintf function in the samples. Another important thing is having the numeric types with fixed size (uint64_t instead of long int) to avoid compatibility issues.
Now, to put it all together, let us take the sample from the previous article and extend it with the idea of segments being stored in MMAP. The file, which the memory is mirrored to, should be big enough to store 3 segments (i. e. larger than sizeof(struct segment) * 3). Look at the following modified sample:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
#define _GNU_SOURCE #include <stdio.h> #include <stdint.h> #include <time.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #define USER_CHUNK_MAX_LENGTH 64 #define USER_CHUNKS_MAX_COUNT 30 #define COUNT_OF_SEGMENTS_IN_FILE 3 struct segment { uint64_t time; char user_chunks[USER_CHUNKS_MAX_COUNT][USER_CHUNK_MAX_LENGTH]; }; int main() { struct segment * segments; int file_descriptor; size_t page_size; size_t file_size; size_t count_of_segments_in_file; // COUNT_OF_SEGMENTS_IN_FILE is 3, hence we are going to have space for three segments struct segment * segment1; struct segment * segment2; struct segment * segment3; // Open the file to be used for "swapping" inside mmap file_descriptor = open("./myfile.bin", O_RDWR | O_CREAT, (mode_t)0644); // Make sure the file has enough space for an array of segments with maximum COUNT_OF_SEGMENTS_IN_FILE // Size of the file must be based on page size of the system page_size = getpagesize(); file_size = (sizeof(struct segment) * COUNT_OF_SEGMENTS_IN_FILE) / page_size * page_size + page_size; fallocate(file_descriptor, 0, 0, file_size); // Obtain the memory for the array of segments using MMAP // The length of the memory is the same as the size of the mapped file // MAP_SHARED - enables "swapping" between the actual RAM and file segments = (struct segment *)mmap( NULL, file_size, PROT_READ | PROT_WRITE, MAP_SHARED, file_descriptor, 0 ); // Obtain individual segments from the memory segment1 = &segments[0]; segment2 = &segments[1]; segment3 = &segments[2]; // Read the first segment's data fprintf( stdout, "First segment starts at '%lu' and contains the following data: '%s', '%s', '%s' ....\n", segment1->time, (char *)segment1->user_chunks[0], (char *)segment1->user_chunks[1], (char *)segment1->user_chunks[2] ); // Write data to the first segment segment1->time = time(NULL); snprintf(segment1->user_chunks[0], USER_CHUNK_MAX_LENGTH, "%s", "1,2,3,4"); snprintf(segment1->user_chunks[1], USER_CHUNK_MAX_LENGTH, "%s", "12,45,44,777,55,99"); snprintf(segment1->user_chunks[2], USER_CHUNK_MAX_LENGTH, "%s", "5656,565,474,122,23,5,6,9,8,7"); // Print the new first segment's time fprintf(stdout, "Now, the first segment starts at '%lu'.\n", segment1->time); // Make sure to rerun the application to see changes in the output // TODO: Modify data in other segments return 0; } |
There should be additional checks for every function to see if the opening of the file, MMAP allocation etc. was successful, but for the purposes of the sample I decided to avoid them. The memory obtained from MMAP is cast to an array of segments (struct segment *). Then we are able to read the address of each segment (segment2 = &segments[1];), print its data and modify them. The sample inserts the first three “chunks” to the chunk array using snprintf (snprintf(segment1->user_chunks[0] …). On the first run, the application prints following output:
1 2 3 |
$ gcc mmapsample.c -o mmapsample $ ./mmapsample First segment starts at '0' and contains the following data: '', '', '' .... |
Now, the first segment starts at ‘1650968627’.
The segment was modified and automatically synchronized to the map file. So after the second run, you get the modified value printed to the console’s standard output. The data are the first three modified “chunks” with user activity IDs separated by commas inserted in the previous run:
1 2 |
$ ./mmapsample First segment starts at '1650968627' and contains the following data: '1,2,3,4', '12,45,44,777,55,99', '5656,565,474,122,23,5,6,9,8,7' .... |
Now, the first segment starts at ‘1650968671’.
Please, feel free to modify the sample to insert more chunks or modify other segments. You can work with them as with normal structs and since they are automatically mirrored to the file, you do not have to worry about their deallocation. In some of the following articles, I would like to focus on more possibilities that come with MMAP, like, for instance, using MMAP in Python for data science tasks.