top banner
Handling Data
1: Introduction
2: Simple example
3: Fancy example
4: Running Gri
5: Programming Gri
6: General Issues
7: X-Y Plots
8: Contour Plots
9: Image Plots
10: Examples
11: Handling Data
12: Gri Commands
13: Gri Extras
14: Evolution of Gri
15: Installing Gri
16: Gri Bugs
17: System Tools
18: Acknowledgments
19: License
20: Newsgroup

21: Concept Index
navigate navigate navigate navigate navigate navigate

11: Handling Data

Gri can handle many different sorts of data file formats, including ascii files, binary files in machine format, and the very powerful and increasingly popular netCDF format. (For more information on netCDF format, see `http://www.unidata.ucar.edu/packages/netcdf/index.html') This chapter concentrates on ascii format. The overall message is that you should not have to modify your data files to work with Gri. For example, many oceanographic data files have header lines at the start. With other plotting systems, users find themselves stripping off these headers as a first step in data analysis. This is done to make the data look like a tabular list, or matrix, for reading by matlab or various spreadsheet-like programs. (It is not necessary to do this in matlab, by the way; you should use the matlab `fgets' command instead, to read and skip the header lines. However, it is almost always necessary to do this in spreadsheet-like programs, especially the GUI-based ones, because the paradigm is often to click on columns of the data that represent variables of interest.) The difficulty with stripping off header lines is that unless you are careful, you can lose the header information unless you are careful to put it in a separate file with an appropriate filename, and then just as careful to archive the header along with the data, and to send both to your colleague who has requested the data, etc. Often the header information seems unimportant to you at the moment, but it may be crucial to you later on, or to the next person who looks at the data! In Gri it is very easy to handle headers within files. It's also easy to handle data that are in somewhat odd formats, or that must be manipulated mathematically or textually to make sense.

11.1: Handling headers

11.1.1: Case 1 -- known number of header lines

This is easy. If you know that the file has, say, 10 header lines, you can just do this:
open file
skip 10
read columns x y
...

11.1.2: Case 2 -- header itself indicates number of header lines

Quite often the first line of a file will indicate the number of header lines. For example, suppose the first line contains a single number, indicating the number of header lines to follow:
open file
read .skip.
skip .skip.
read columns x y
...

11.1.3: Case 3 -- header lines marked by a textual key

Sometimes header lines are indicated by a textual key, for example, the characters `HEADER' at the start of the line in the file. The easy way to skip such a header is to use a system command. Depending on your familiarity with the operating system (here presumed to be Unix), you might choose to use Grep, Awk, or Perl. Here are examples:
open "grep -v '^HEADER' file |"
For more on the `|' mechanism, see Open. The Grep command prints lines which do not match the indicated string (because of the `-v' switch), and the `^' character stands for the start of the line see Grep. Thus all lines with the key word at the start of the line are skiped.

11.1.4: Case 4 -- reading and using information in header

Consider a dataset in which the first line gives the time of observation, followed by a list of observations. This might be, for example, an indication of the data taken from a weather balloon released at a particular time from a fixed location, with the main data being air temperature as a function of elevation of the balloon. The time indication might be, for instance, the hour number. One might need to know the time to print a label on the diagram. You could do that by:
open file
read .time.
read columns x y
draw curve
sprintf \label "Time of observation is %f hour" .time.
draw title "\label"
where the `sprintf' command has been used to change the numerical time indication into a synonym that can be inserted into a quoted string for drawing the title of the diagram see Sprintf. Here the time has been assumed to be a decimal hour. You might also have three numbers on the line, perhaps a day, an hour and a minute. Then you could do something like
open file
read .d. .h. .m.
read columns x y
draw curve
sprintf \label "Obs. %.0f:%.0f, day %.0f" .h. .m. .d.
draw title "\label"
Here the `%.0f' code is used to ensure no numbers will be written after the decimal point. Naturally, you could convert this to a decimal day, by e.g.
...
.dday. = {rpn .day. .hour. 24 / .min. 24 / 60 /}
sprintf \label "Decimal day is %.4f" .dday.
...
(Some of you might know how many minutes in a day, but I'm silly so I kept the extra mathematical step -- nothing is lost by being straightforward!) Data set often have information on observation location as well as time, and I hope the above suggests how to handle that. It is very common, by the way, to want to draw so-called "waterfall" plots, in which many curves are drawn, each one offset by some amount. Sometimes you'll want the offset to be constant, but just as often you'll want the offset to be dependent on information in a header, such as the time of observation. I hope the above indicates roughly how you might handle this, but a snippet is:
open file1
read .time.
read columns x y
y += {rpn .time. 100 /}
draw curve

open file2
read .time.
read columns x y
y += {rpn .time. 100 /}
draw curve
...
Here a scale factor of 100 has been applied to y, to convert time units into an offset in terms of y.

11.2: Ignoring columns that are not of interest

Quite often a dataset will have many columns, of which only a couple are of interest to you. Consider for example an oceanographic data which has columns storing, in order, these variables: (1) depth in water column, (2) "in situ" temperature, (3) "potential" temperature, (4) salinity, (5) conductivity, (6) density, (7) sigma-theta, (8) sound speed, and (9) oxygen concentration. But you might only be interested in plotting a graph of salinity on the x-axis and depth on the y-axis. Here are several ways to do this:
open file
read columns y * * x
draw curve
where the `*' is a place-keeper to indicate to skip that column. For a large number of columns, or as an aesthetic choice, you might prefer to write this as
open file
read columns y=1 x=4
draw curve
Many users would just as soon not bother with this syntax, preferring instead to use system tools with which they are more familiar. So a Gawk user might write
open "gawk 'print($1, $4)' file |"
read columns y x
draw curve
For more on the Gawk command, see Awk.

11.3: Manipulating columns

Suppose the file contains (x,y), but you wish to plot 2y times x. You could do the doubling of y within Gri, as
open file
read columns x y
y *= 2
draw curve
or you could use a system tool, e.g. gawk, as in this example see Awk.
open "gawk 'print($1,2*$2)' file|"
read columns x y
draw curve
The latter is preferable in the sense that it is more powerful. The reason for this is that Gri allows you to manipulate the x and y columns, using so-called RPN mathematics see rpn Mathematics, but you cannot blend the columns. For example, you cannot easily form the ratio y/x in Gri. (Actually, you can, by looping through your data and doing the calculation index by index, but if you knew that already you wouldn't need to be reading this section!) Such blending is trivial in the operating system, though, as in the following Gawk example see Awk.
open "gawk 'print($1, $2/$1)' file |"
read columns x y
draw curve

11.4: Combining columns from different files

Suppose you want to plot a column (`y', say) from one file versus a second column (`x') from a second data file. The easy way is to use a system command to create a new file, for example the Unix command `paste' -- but of course you don't want to clutter your filesystem with such files, so you should do this withing Gri:
open "paste file1 file2 |"
read columns x y
draw curve
bottom banner