Encoding ODB-2 Data
Trivial Example
Given a pandas DataFrame
to encode it, the data should simply be passed to encode_odb()
function:
[1]:
import sys
import os
sys.path.insert(0, os.path.abspath('../../..'))
[2]:
import pandas as pd
import pyodc as odc
df = pd.read_csv('data-1.csv')
odc.encode_odb(df, 'example-1.odb')
File Type Object
Encoding of ODB-2 data works with file-like objects as well as with file names:
[3]:
with open('example-1.odb', 'wb') as f:
odc.encode_odb(df, f)
Configuring Encoded Columns
By default, pyodc will always encode ODB-2 data in a lossless manner. In particular, most values are encoded as 8-byte DOUBLE values.
Typically, the encoder will automatically select a data type and corresponding encoder to use. This data type can be overridden by supplying a types dictionary, for example to encode a column as a 4-byte REAL value:
[4]:
odc.encode_odb(df, 'example-3.odb', types={'obsvalue@body': odc.REAL})
The interrogation of the frame headers shows that the data type has changed:
[5]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r3 = odc.Reader('example-3.odb', aggregated=False)
print('original:', r1.frames[0].column_dict['obsvalue@body'].dtype)
print('updated: ', r3.frames[0].column_dict['obsvalue@body'].dtype)
original: DataType.DOUBLE
updated: DataType.REAL
Decoded data also confirms that the precision has been appropriately reduced:
[6]:
df_decoded = odc.read_odb('example-3.odb', single=True)
print(df_decoded)
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.000000
1 1 20210420 stat01 0-12345-0-67891 12.345600
2 1 20210420 stat02 0-12345-0-67892 24.691200
3 1 20210420 stat03 0-12345-0-67893 37.036800
4 1 20210420 stat04 0-12345-0-67894 49.382401
5 1 20210420 stat05 0-12345-0-67895 61.728001
6 1 20210420 stat06 0-12345-0-67896 74.073601
7 1 20210420 stat07 0-12345-0-67897 86.419197
8 1 20210420 stat08 0-12345-0-67898 98.764801
9 1 20210420 stat09 0-12345-0-67899 111.110397
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 0 0.0
1 4321.0 43.21 9 9.0
2 NaN NaN 6 6.0
3 1234.0 12.34 10 10.0
4 4321.0 43.21 5 5.0
5 NaN NaN 7 NaN
6 1234.0 12.34 15 15.0
7 4321.0 43.21 0 0.0
8 NaN NaN 9 9.0
9 1234.0 12.34 6 6.0
Configuring Frame Structure
ODB-2 data is broken down into frames. By default a maximum of 10 000 rows of data will be encoded into each frame. If more than 10 000 rows are supplied, then the data will be split into a sequence of frames with at maximum 10 000 rows.
To modify the threshold, pass rows_per_frame
argument:
[7]:
odc.encode_odb(df, 'example-4.odb', rows_per_frame=3)
Examination of the frame structure clearly shows that the data now contains multiple frames:
[8]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r4 = odc.Reader('example-4.odb', aggregated=False)
print('original frames:', r1.frames)
print('updated frames:', r4.frames)
print('original row counts:', [f.nrows for f in r1.frames])
print('updated row counts:', [f.nrows for f in r4.frames])
original frames: [<pyodc.frame.Frame object at 0x1218d4b80>]
updated frames: [<pyodc.frame.Frame object at 0x1080e3c70>, <pyodc.frame.Frame object at 0x121867610>, <pyodc.frame.Frame object at 0x121867d30>, <pyodc.frame.Frame object at 0x121866e60>]
original row counts: [10]
updated row counts: [3, 3, 3, 1]
Despite these differences, if decoded the data is the same:
[9]:
df_decoded = odc.read_odb('example-4.odb', single=True)
print(df_decoded)
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.0000
1 1 20210420 stat01 0-12345-0-67891 12.3456
2 1 20210420 stat02 0-12345-0-67892 24.6912
3 1 20210420 stat03 0-12345-0-67893 37.0368
4 1 20210420 stat04 0-12345-0-67894 49.3824
5 1 20210420 stat05 0-12345-0-67895 61.7280
6 1 20210420 stat06 0-12345-0-67896 74.0736
7 1 20210420 stat07 0-12345-0-67897 86.4192
8 1 20210420 stat08 0-12345-0-67898 98.7648
9 1 20210420 stat09 0-12345- 111.1104
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 0 0.0
1 4321.0 43.21 9 9.0
2 NaN NaN 6 6.0
3 1234.0 12.34 10 10.0
4 4321.0 43.21 5 5.0
5 NaN NaN 7 NaN
6 1234.0 12.34 15 15.0
7 4321.0 43.21 0 0.0
8 NaN NaN 9 9.0
9 1234.0 12.34 6 6.0
Additional Properties
To encode additional properties as part of frame’s data, specify properties
parameter to encode_odb()
function with a dictionary value you want to include:
[10]:
metadata = {
'encoded_by': 'ECMWF',
'data_source': 'pyodc_docs',
}
odc.encode_odb(df, 'example-5.odb', properties=metadata)
Encoded properties are accessible via properties
key of the frame object:
[11]:
r1 = odc.Reader('example-5.odb')
print([f.properties for f in r1.frames])
[{'encoded_by': 'ECMWF', 'data_source': 'pyodc_docs'}]
Encoding Bitfields
Bitfield columns encode integer values accompanied by metadata describing the nature of the bits. A columns cannot be auto-detected as a bitfield, as the data will be considered integral, so the type must be set explicitly.
And additional dictionary object may be passed to the encode function containing the bitfield structure. For each bitfield column a sequence of values should be supplied corresponding to the specific bit fields. Each of these values can take one of two forms:
A string, naming the bit field (which will be assumed to comprise a single bit)
A tuple of the name of the bit field and the number of corresponding bits
The sequence supplied should match the number of bits set in the values.
[12]:
types = {
'bf_column': odc.BITFIELD,
'bf_missing': odc.BITFIELD,
}
bitfields = {
'bf_column': ['bit1', ('bitpair', 2), ('bit4', 1)],
'bf_missing': ['bit1', ('bitpair', 2), ('bit4', 1)]
}
odc.encode_odb(df, 'example-6.odb', types=types, bitfields=bitfields)
This data can be seen by explicitly decoding the bit fields.
[13]:
df_decoded = odc.read_odb('example-6.odb',
columns=['bf_column.bit1', 'bf_column.bitpair', 'bf_column.bit4',
'bf_missing.bit1', 'bf_missing.bitpair', 'bf_missing.bit4'],
single=True)
print(df_decoded)
bf_column.bit1 bf_column.bitpair bf_column.bit4 bf_missing.bit1 \
0 False 0 False False
1 True 0 True True
2 False 3 False False
3 False 1 True False
4 True 2 False True
5 True 3 False None
6 True 3 True True
7 False 0 False False
8 True 0 True True
9 False 3 False False
bf_missing.bitpair bf_missing.bit4
0 0.0 False
1 0.0 True
2 3.0 False
3 1.0 True
4 2.0 False
5 NaN None
6 3.0 True
7 0.0 False
8 0.0 True
9 3.0 False