seiscomp-training/share/doc/caps/html/_sources/base/archive.rst.txt

.. |nbsp| unicode:: U+00A0
.. |tab| unicode:: U+00A0 U+00A0 U+00A0 U+00A0

.. _sec-archive:

Data Management
***************

:term:`CAPS` uses the :term:`SDS` directory
structure for its archives shown in figure :num:`fig-archive`. SDS organizes
the data in directories by year, network, station and channel.
This tree structure eases archiving of data. One complete year may be
moved to an external storage, e.g. a tape library.

.. _fig-archive:

.. figure:: media/sds.png
   :width: 12cm

   SDS archive structure of a CAPS archive

The data are stored in the channel directories. One file is created per sensor
location for each day of the year. File names take the form
:file:`$net.$sta.$loc.$cha.$year.$yday.data` with

* **net**: network code, e.g. 'II'
* **sta**: station code, e.g. 'BFO'
* **loc**: sensor location code, e.g. '00'. Empty codes are supported
* **cha**: channel code, e.g. 'BHZ'
* **year**: calender year, e.g. '2021'
* **yday**: day of the year starting with '000' on 1 January

.. note ::

   In contrast to CAPS archives, in SDS archives created with
   `slarchive <https://docs.gempa.de/seiscomp/current/apps/slarchive.html>`_
   the first day of the year, 1 January, is referred to by index '001'.


.. _sec-caps-archive-file-format:

File Format
===========

:term:`CAPS` uses the `RIFF
<http://de.wikipedia.org/wiki/Resource_Interchange_File_Format>`_ file format
for data storage. A RIFF file consists of ``chunks``. Each chunk starts with a 8
byte chunk header followed by data. The first 4 bytes denote the chunk type, the
next 4 bytes the length of the following data block. Currently the following
chunk types are supported:

* **SID** - stream ID header
* **HEAD** - data information header
* **DATA** - data block
* **BPT** - b-tree index page
* **META** - meta chunk of the entire file containing states and a checksum

Figure :num:`fig-file-one-day` shows the possible structure of an archive
file consisting of the different chunk types.

.. _fig-file-one-day:

.. figure:: media/file_one_day.png
   :width: 18cm

   Possible structure of an archive file


SID Chunk
---------

A data file may start with a SID chunk which defines the stream id of the
data that follows in DATA chunks. In the absence of a SID chunk, the stream ID
is retrieved from the file name.

===================== ========= =====================
content               type      bytes
===================== ========= =====================
id="SID"              char[4]   4
chunkSize             int32     4
networkCode + '\\0'    char*    len(networkCode) + 1
stationCode + '\\0'    char*    len(stationCode) + 1
locationCode + '\\0'   char*    len(locationCode) + 1
channelCode + '\\0'    char*    len(channelCode) + 1
===================== ========= =====================


HEAD Chunk
----------

The HEAD chunk contains information about subsequent DATA chunks. It has a fixed
size of 15 bytes and is inserted under the following conditions:

* before the first data chunk (beginning of file)
* packet type changed
* unit of measurement changed

===================== ========= ========
content               type      bytes
===================== ========= ========
id="HEAD"             char[4]   4
chunkSize (=7)        int32     4
version               int16     2
packetType            char      1
unitOfMeasurement     char[4]   4
===================== ========= ========

The ``packetType`` entry refers to one of the supported types described in
section :ref:`sec-packet-types`.

DATA Chunk
----------

The DATA chunk contains the actually payload, which may be further structured
into header and data parts.

===================== ========= =========
content               type      bytes
===================== ========= =========
id="DATA"             char[4]   4
chunkSize             int32     4
data                  char*     chunkSize
===================== ========= =========

Section :ref:`sec-packet-types` describes the currently supported packet types.
Each packet type defines its own data structure. Nevertheless :term:`CAPS`
requires each type to supply a ``startTime`` and ``endTime`` information for
each record in order to create seamless data streams. The ``endTime`` may be
stored explicitly or may be derived from ``startTime``, ``chunkSize``,
``dataType`` and ``samplingFrequency``.

In contrast to a data streams, :term:`CAPS` also supports storing of individual
measurements. These measurements are indicated by setting the sampling frequency
to 1/0.

BPT Chunk
---------

BPT chunks hold information about the file index. All data records are indexed
using a B+ tree. The index key is the tuple of start time and end time of each
data chunk to allow very fast time window lookup and to minimize disc accesses.
The value is a structure and holds the following information:

* File position of the format header
* File position of the record data
* Timestamp of record reception

This chunk holds a single index tree page with a fixed size of 4kb
(4096 byte). More information about B+ trees can be found at
https://en.wikipedia.org/wiki/B%2B_tree.

META Chunk
----------

Each data file contains a META chunk which holds information about the state of
the file. The META chunk is always at the end of the file at a fixed position.
Because CAPS supports pre-allocation of file sizes without native file system
support to minimize disc fragmentation it contains information such as:

* effectively used bytes in the file (virtual file size)
* position of the index root node
* the number of records in the file
* the covered time span

and some other internal information.


.. _sec-optimization:

Optimization
============

After a plugin packet is received and before it is written to disk,
:term:`CAPS` tries to optimize the file data in order reduce the overall data
size and to increase the access time. This includes:

* **merging** data chunks for continuous data blocks
* **splitting** data chunks on the date limit
* **trimming** overlapped data


Merging of Data Chunks
----------------------

:term:`CAPS` tries to create large continues blocks of data by reducing the
number of data chunks. The advantage of large chunks is that less disk space is
occupied by data chunk headers. Also seeking to a particular time stamp is
faster because less data chunk headers need to be read.

Data chunks can be merged if the following conditions apply:

* merging is supported by packet type
* previous data header is compatible according to packet specification, e.g.
  ``samplingFrequency`` and ``dataType`` matches
* ``endTime`` of last record equals ``startTime`` of new record (no gap)

Figure :num:`fig-file-merge` shows the arrival of a new plugin packet. In
alternative A) the merge failed and a new data chunk is created. In alternative B)
the merger succeeds. In the latter case the new data is appended to the existing
data block and the original chunk header is updated to reflect the new chunk
size.

.. _fig-file-merge:

.. figure:: media/file_merge.png
   :width: 18cm

   Merging of data chunks for seamless streams


Splitting of Data Chunks
------------------------

Figure :num:`fig-file-split` shows the arrival of a plugin packet containing
data of 2 different days. If possible, the data is split on the date limit. The
first part is appended to the existing data file. For the second part a new day
file is created, containing a new header and data chunk. This approach ensures
that a sample is stored in the correct data file and thus increases the access
time.

Splitting of data chunks is only supported for packet types providing the
``trim`` operation.

.. _fig-file-split:

.. figure:: media/file_split.png
   :width: 18cm

   Splitting of data chunks on the date limit


Trimming of Overlaps
--------------------

The received plugin packets may contain overlapping time spans. If supported by
the packet type :term:`CAPS` will trim the data to create seamless data streams.


.. _sec-packet-types:

Packet Types
============

:term:`CAPS` currently supports the following packet types:

* **RAW** - generic time series data
* **ANY** - any possible content
* **MiniSeed** - native :term:`MiniSeed`


.. _sec-pt-raw:

RAW
---

The RAW format is a lightweight format for uncompressed time series data with a
minimal header. The chunk header is followed by a 16 byte data header:

============================ ========= =========
content                      type      bytes
============================ ========= =========
dataType                     char      1
*startTime*                  TimeStamp [11]
|tab| year                   int16     2
|tab| yDay                   uint16    2
|tab| hour                   uint8     1
|tab| minute                 uint8     1
|tab| second                 uint8     1
|tab| usec                   int32     4
samplingFrequencyNumerator   uint16    2
samplingFrequencyDenominator uint16    2
============================ ========= =========

The number of samples is calculated by the remaining ``chunkSize`` divided by
the size of the ``dataType``. The following data types value are supported:

==== ====== =====
id   type   bytes
==== ====== =====
  1  double 8
  2  float  4
100  int64  8
101  int32  4
102  int16  2
103  int8   1
==== ====== =====

The RAW format supports the ``trim`` and ``merge`` operation.


.. _sec-pt-any:

ANY
---

The ANY format was developed to store any possible content in :term:`CAPS`. The chunk
header is followed by a 31 byte data header:

============================ ========= =========
content                      type      bytes
============================ ========= =========
type                         char[4]   4
dataType (=103, unused)      char      1
*startTime*                  TimeStamp [11]
|tab| year                   int16     2
|tab| yDay                   uint16    2
|tab| hour                   uint8     1
|tab| minute                 uint8     1
|tab| second                 uint8     1
|tab| usec                   int32     4
samplingFrequencyNumerator   uint16    2
samplingFrequencyDenominator uint16    2
endTime                      TimeStamp 11
============================ ========= =========

The ANY data header extends the RAW data header by a 4 character ``type``
field. This field is indented to give a hint on the stored data. E.g. an image
from a Web cam could be announced by the string ``JPEG``.

Since the ANY format removes the restriction to a particular data type, the
``endTime`` can no longer be derived from the ``startTime`` and
``samplingFrequency``. Consequently the ``endTime`` is explicitly specified in
the header.

Because the content of the ANY format is unspecified it neither supports the
``trim`` nor the ``merge`` operation.

.. _sec-pt-miniseed:

MiniSeed
--------

`MiniSeed <http://www.iris.edu/data/miniseed.htm>`_ is the standard for the
exchange of seismic time series. It uses a fixed record length and applies data
compression.

:term:`CAPS` adds no additional header to the :term:`MiniSeed` data. The
:term:`MiniSeed` record is directly stored after the 8-byte data chunk header.
All meta information needed by :term:`CAPS` is extracted from the
:term:`MiniSeed` header. The advantage of this native :term:`MiniSeed` support
is that existing plugin and client code may be reused. Also the transfer and
storage volume is minimized.

Because of the fixed record size requirement neither the ``trim`` nor the
``merge`` operation is supported.

.. TODO:

   \subsection{Archive Tools}

   \begin{itemize}
    \item {\tt\textbf{riffsniff}} --
    \item {\tt\textbf{rifftest}} --
   \end{itemize}