File: AAREADME.txt
Database: Natus AEEG Corpus (NAEG)
Version: 1.0.0
-------------------------------------------------------------------------------
Change Log:

 v1.0.0 (20250420): Initial release of the first 100 studies
-------------------------------------------------------------------------------

This file contains some basic statistics about the Natus Ambulatory EEG
(NAEG) Corpus. This subset consists of 100 studies that are nominally 72-hour
continuous recordings.

When you use this specific corpus in your research or technology
development, we ask that you reference the corpus using this
publication:

 Melles, A.-M., Paderewski, M., Oymann, R., Shah, V., Salazar, J.,
 Obeid, I., & Picone, J. (2024). Annotation of Ambulatory EEGs. Proceedings
 of the IEEE Signal Processing in Medicine and Biology Symposium, 1–4.
 doi: 10.1109/SPMB62441.2024.10842264

This publication can be retrieved from:

https://isip.piconepress.com/publications/conference_presentations/2024/ieee_spmb/aeeg/

There are two main directories in this release:

 nedc_130_[1]: p
 /data/isip/data/natus_aeeg/v1.0.0
 nedc_130_[1]: d
 ...
 drwxrwxr-x   3 picone isip    5 Apr 19 16:57 DOCS/
 drwxrwxr-x 102 picone isip  102 Feb 21 16:23 edf/
 ...

/DOCS contains relevant documentation including an annotator log that includes
comments about each study, montages that are used to load the data into our
annotation tool, and a list of seizure types.

The EEG data is stored in edf files located in the edf directory in the
following directory structure:

 edf
  edf/d0142fa23def05ce051d3c56514d8fef
   edf/d0142fa23def05ce051d3c56514d8fef_00.edf
   edf/d0142fa23def05ce051d3c56514d8fef_00/
    d0142fa23def05ce051d3c56514d8fef_00_000.csv
    d0142fa23def05ce051d3c56514d8fef_00_000.csv_bi
    d0142fa23def05ce051d3c56514d8fef_00_000.edf
    d0142fa23def05ce051d3c56514d8fef_00_001.csv
    d0142fa23def05ce051d3c56514d8fef_00_001.csv_bi
    d0142fa23def05ce051d3c56514d8fef_00_001.edf
    ...
    d0142fa23def05ce051d3c56514d8fef_00_011.csv
    d0142fa23def05ce051d3c56514d8fef_00_011.csv_bi
    d0142fa23def05ce051d3c56514d8fef_00_011.edf
   edf/d0142fa23def05ce051d3c56514d8fef_01.edf
   edf/d0142fa23def05ce051d3c56514d8fef_01
   ...
   d0142fa23def05ce051d3c56514d8fef_11.edf
   d0142fa23def05ce051d3c56514d8fef_11

The study identifier is d0142fa23def05ce051d3c56514d8fef. This was split
into 12 edf files, each which is 6 hours in duration. This is the way the data was
delivered to us from Natus.

Each of these 6-hour files was split into nominally 12 30-minute files
(e.g., *_00_000.edf, *_00_001.edf), and stored in a subdirectory with the
same study name and sequence number (e.g., "_00"). This was done mainly for
annotator convenience. Our interactive tools run much faster when the signal is less
than one hour in duration.

Within this subdirectory, there are three types of files:

 *.edf:    the EEG sampled data in European Data Format (edf)
 *.csv:    event-based annotations using all available seizure type classes
 *.csv_bi: term-based annotations using only two labels (bckg and seiz)

Event-based annotations are per-channel. This means the annotation contains,
in addition to a start and stop time, a channel index. Seizures often can
be observed on one or more channels and then spread to other channels.
Event-based annotations capture this.

Term-based annotations use one label that applies to all channels. These
are most useful for machine learning research in which we tend to worry
only about the overall classification of a segment and are not concerned
about individual channels.

Bi-class annotations use two labels: seizure (seiz) and background
(bckg).  The multi-class annotations use all available seizure
types. These are described in the spreadsheet:

 $NAEG/v1.0.0/DOCS/seizures_types_v02.xlsx

The channel arrangement for this data are consistent:

	  channel[   0]:      200.0 Hz (FP1)
	  channel[   1]:      200.0 Hz (F7)
	  channel[   2]:      200.0 Hz (T3)
	  channel[   3]:      200.0 Hz (A1)
	  channel[   4]:      200.0 Hz (T5)
	  channel[   5]:      200.0 Hz (O1)
	  channel[   6]:      200.0 Hz (F3)
	  channel[   7]:      200.0 Hz (C3)
	  channel[   8]:      200.0 Hz (P3)
	  channel[   9]:      200.0 Hz (FZ)
	  channel[  10]:      200.0 Hz (CZ)
	  channel[  11]:      200.0 Hz (PZ)
	  channel[  12]:      200.0 Hz (FP2)
	  channel[  13]:      200.0 Hz (F8)
	  channel[  14]:      200.0 Hz (T4)
	  channel[  15]:      200.0 Hz (A2)
	  channel[  16]:      200.0 Hz (T6)
	  channel[  17]:      200.0 Hz (O2)
	  channel[  18]:      200.0 Hz (F4)
	  channel[  19]:      200.0 Hz (C4)
	  channel[  20]:      200.0 Hz (P4)
	  channel[  21]:      200.0 Hz (X1)
	  channel[  22]:      200.0 Hz (X2)
	  channel[  23]:      200.0 Hz (DIF1)
	  channel[  24]:       57.0 Hz (EDF ANNOTATIONS)

To annotate this data, we used a tcp_ar montage:

 nedc_130_[1]: more DOCS/montages/01_tcp_ar_natus_montage.txt
 # file: $NATUS_AEEG/DOCS/01_tcp_ar_natus_montage.txt
 #
 # This file contains our first attempt at a tcp_ar montage.
 #
 [Montage]
 montage = 0, FP1-F7: FP1 -- F7
 montage = 1, F7-T3: F7 -- T3
 montage = 2, T3-T5: T3 -- T5
 montage = 3, T5-O1: T5 -- O1
 montage = 4, FP2-F8: FP2 -- F8
 montage = 5, F8-T4: F8 -- T4
 montage = 6, T4-T6: T4 -- T6
 montage = 7, T6-O2: T6 -- O2
 montage = 8, A1-T3: A1 -- T3
 montage = 9, T3-C3: T3 -- C3
 montage = 10, C3-CZ: C3 -- CZ
 montage = 11, CZ-C4: CZ -- C4
 montage = 12, C4-T4: C4 -- T4
 montage = 13, T4-A2: T4 -- A2
 montage = 14, FP1-F3: FP1 -- F3
 montage = 15, F3-C3: F3 -- C3
 montage = 16, C3-P3: C3 -- P3
 montage = 17, P3-O1: P3 -- O1
 montage = 18, FP2-F4: FP2 -- F4
 montage = 19, F4-C4: F4 -- C4
 montage = 20, C4-P4: C4 -- P4
 montage = 21, P4-O2: P4 -- O2

To learn more about this, please review this publication:

 Lopez, S., Gross, A., Yang, S., Golmohammadi, M., Obeid, I., &
 Picone, J. (2016). An Analysis of Two Common Reference Points for
 EEGs. In IEEE Signal Processing in Medicine and Biology Symposium
 (pp. 1–4). Philadelphia, Pennsylvania, USA. Available at:
 https://www.isip.piconepress.com/publications/conference_proceedings/2016/ieee_spmb/montages/.

Finally, here are some basic descriptive statistics about the data.
The Linux commands used to generate these numbers are shown below.
For the commands below, the starting point was here:

 /data/isip/data/natus_aeeg_v1.0.0/edf

( 1) Number of 30-minute edf/csv/csv_bi files: 10,875

 nedc_130_[1]: find . -name "*_??_???.edf" | wc -l
 10875
 nedc_130_[1]: find . -name "*_??_???.csv" | wc -l
 10875
 nedc_130_[1]: find . -name "*_??_???.csv_bi" | wc -l
 10875

( 2) Number of 6-hour recordings: 950

 nedc_130_[1]: find . -maxdepth 2 -mindepth 2 -type d | wc -l
 950

( 3) Number of patients: 100

 nedc_130_[1]: find . -maxdepth 1 -mindepth 1 -type d | wc -l
 100

( 4) Number of files with seizures: 1,821

 nedc_130_[1]: find . -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f4 | cut -d":" -f1 | sort -u | wc -l
 1821

( 5) Number of 6-hour recordings sessions with seizures: 422

 nedc_130_[1]: find . -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2,3 | sort -u | wc -l
 422

( 6) Number of studies with seizures: 65

 nedc_130_[1]: find . -name "*.csv" -exec grep -H "sz," {} \; | cut -d"/" -f2 | sort -u | wc -l
 65

( 7) Total number of seizure events (measured using *.csv_bi):

 nedc_130_[1]: find . -name "*.csv_bi" -exec grep -H seiz {} \; | wc -l
 3779

( 8) Total duration: 19,494,593 secs (5,411 hours)

 nedc_130_[1]: find . -name "*.csv" -exec grep duration {} \; | awk '{ sum+=$4} END {print sum}'
 19494593

( 9) Total size of the corpus: 397,975 Mbytes (398.0 Gbytes)

 nedc_130_[1]: cd  /data/isip/data/natus_aeeg/v1.0.0/edf
 nedc_130_[1]: du -sBM .
 397899M	.

(10) Total duration of seizure events: 53,018.9000 secs

 nedc_130_[1]: find . -name "*.csv_bi" -exec grep -H "seiz," {} \; | cut -d"," -f2,3 | sed -e "s/,/ /g" | awk '{ sum +=($2-$1)} END {print sum}'
 53018.9

-----------------------------

If you have any additional comments or questions about the data,
please direct them to help@nedcdata.org.

Best regards,

Joe Picone