How to read binary files in Python using NumPy?

I know how to read binary files in Python using NumPy's np.fromfile() function. The issue I'm faced with is that when I do so, the array has exceedingly large numbers of the order of 10^100 or so, with random nan and inf values.

I need to apply machine learning algorithms to this dataset and I cannot work with this data. I cannot normalise the dataset because of the nan values.

I've tried np.nan_to_num() but that doesn't seem to work. After doing so, my min and max values range from 3e-38 and 3e+38 respectively, so I could not normalize it.

Is there any way to scale this data down? If not, how should I deal with this?

Thank you.

EDIT:

Some context. I'm working on a malware classification problem. My dataset consists of live malware binaries. They are files of the type .exe, .apk etc. My idea is store these binaries as a numpy array, convert to a grayscale image and then perform pattern analysis on it.

2 Answers

If you want to make an image out of a binary file, you need to read it in as integer, not float. Currently, the most common format for images is unsigned 8-bit integers.

As an example, let's make an image out of the first 10,000 bytes of /bin/bash:

>>> import numpy as np
>>> import cv2
>>> xbash = np.fromfile('/bin/bash', dtype='uint8')
>>> xbash.shape
(1086744,)
>>> cv2.imwrite('bash1.png', xbash[:10000].reshape(100,100))

In the above, we used the OpenCV library to write the integers to a PNG file. Any of several other imaging libraries could have been used.

This what the first 10,000 bytes of bash "looks" like:

EDIT 2

Refer this answer:
It states: NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
source:

Numpy integer nan
Accepted answer states:NaN can't be stored in an integer array. A nan is a special value for float arrays only. There are talks about introducing a special bit that would allow non-float arrays to store what in practice would correspond to a nan, but so far (2012/10), it's only talks. In the meantime, you may want to consider the numpy.ma package: instead of picking an invalid integer like -99999, you could use the special numpy.ma.masked value to represent an invalid value.

a = np.ma.array([1,2,3,4,5], dtype=int)
a[1] = np.ma.masked
masked_array(data = [1 -- 3 4 5], mask = [False True False False False], fill_value = 999999)

EDIT 1

To read binary file:

Read the binary file content like this:
```
with open(fileName, mode='rb') as file: # b is important -> binary fileContent = file.read()
```
After that you can "unpack" binary data using struct.unpack
If you are using np.fromfile() function:
numpy.fromfile, which can read data from both text and binary files. You would first construct a data type, which represents your file format, usingnumpy.dtype, and then read this type from file using numpy.fromfile.

Velvet Star Monitor

How to read binary files in Python using NumPy?

2 Answers

Your Answer

Sign up or log in

Post as a guest

Similar Journal

How can I delete local content from a game not in my list in Steam?

Why are my settlers not doing their jobs and refusing to stay assigned?

Loot amount in town hall

Who is the Milfanito by the boss door, and what does she do?