Configurable tool for experiment data formatting and basic statistics

by Radu Luchian(ov)
(Presentation for the Summer School 2000)

Intro, linking the current presentation with the one I made in 1998: placing StaMon in the context of the projected on-line experiment management tool (MonEx).

why not use the existing statistical packages?

see it at work

see its self-presentation at work

Goals of this presentation:

to describe the beta version and projected features

to gauge the interest of researchers in such a tool

to collect suggestions on new format modules for current tasks

StaMon description:
development environment: MacPerl5, portable on Windows, Unix, OS/2, etc.
modular structure: library of routines accessed from a template-like script
list of modules:
- low-level data input set (converters for reading:)
  - Standard PsyScope data: subject-files;
  - Standard PsyScope data: experiment-files;
  - MonEx data: subject-files and experiment-files;
  - token-separated text: data exported from Word, Excel, Statistica, Systat, etc.;
- medium-level modules
  - experiment data tracking;
  - variable values checking/reporting;
  - variable creation and recoding;
  - consistency/correctness verification;
  - configurable table export;
  - basic statistics on a set of variables by a set of conditions
    - avg, range, freq, set, variance, SD, normalization (z-transform);
    - contingency table verification and related confidence intervals;
    - chi-square, t-test, etc.
- high-level automated modules
  - experiment design checker
  - word-based translator, etc.

Projected features:

GUI;

advanced statistics modules;

integration in MonEx;

First some Legalese, for good measure:

MoStaCon is an original work based on the perl filters I wrote for the Cognitive Science Department at the New Bulgarian University between 1998 and 2000. It was written by Radu Luchian(ov) (radu@monicsoft.net) and is copyright (c) 1998-2000. All Rights Reserved.

MoStaCon is provided as "freeware." There is no fee attached to using it. However, it is NOT public domain software. You must abide by the terms of this license when you use it.

You may use, reproduce, and distribute MoStaCon freely, so long as you do not modify MoStaCon and as long as you give me (its author) due credit in any publication in which you use data formatted with it or results achieved from the included analyses. If you think you can add a useful feature to MoStaCon or you would like some modification made to any part of it, please send the author a message with the subject "MoStaCon feature request".

MoStaCon is available as is. In no event will the author be liable for damages to research data, backups or for loss of other resources due to the uninformed handling or mishandling of MoStaCon. Users are advised to use copies of their data with the MoStaCon filters.

That being said, the product is stable and I'm doing my best to check every module as I add it and to predict and circumvent any hazardous input or handling. I hope you enjoy using it.

Now for the discussion:

Some years ago, a team at Carnegie-Mellon University released in the freeware domain an excellent tool for online-experiment creation, a full-fledged Multimedia authoring system; I'm talking about PsyScope. Unfortunatey, the data analysis tool that came with it (), was/is too limited and its extension involved C-source compilation and all the flexibility and tedium implied by that. Since I planned a Web-based experiment-building tool (MonEx) for running my own experiments, I extended the alpha version of the data formatting and statistics module to handle PsyScope files. And I got MoStaCon. There is no exact script that answers to that name; rather, MoStaCon is a system that generates a family of scripts written in, which can subsequently be used to reformat your data and do on it analyses which otherwise would take days and months of manual or semi-manual work. For the ones who see the word for the first time, perl document-processing programming language extensively used in Web automation. It is implemented on many platforms (DOS, Win95/NT, Mac, UNIX, etc.) The scripts generated by MoStaCon have a slight Mac bias, but they are usable on any other platform. (I find the combination MacPerl-BBEdit-[Netscape|Shuck] the best IDE around)

What follows is the 'contextual help' for the form that helps you generate a MoStaCon script. The idea is the following: MoStaCon's modules are programmed into the HTML form, which helps a perl programmer to write the data formatting/analysis script; actually the form is so detailed that the researcher using it needs only very little knowledge of perl in order to use it. The necessary knowledge is a twenty-minute lecture on variable usage and file formats (presented below), plus an understanding of experiment design and a/some specific hypotheses s/he wants to test based on the data coming out of an experiment.

For a quick start, please check the aleady filled-in examples:

Gender Agreement

Lexical Access - Verbs and Adjectives

Gender-Sex Interaction

SI Animacy
You're welcome to play with the examples to modify things and add things then click buttons to see how the changes affect the different blocks of the script.
... but it's always better to learn something hands-on and step-by step, so read along.

Screen resolution

The larger the resolution you use when using the form, the better (1024/768 would be nice). You can reduce the number of colors to 16 if that allows you more pixels. The reason is that you need to see as many fields of this form as possible, so you can better spot spelling errors. It also helps by giving you a larger picture of the steps you're following. As an alternative, you may want to print the form, fill in the details on paper, then enter them from there.

Note on spelling errors:
Please try to avoid such, since they'll result in endless syntax errors once you try running the script; what's worse, if you misspell a variable name, the script may run perfectly but you may get errors in the output files!!

Summary variables

MoStaCon helps you with design too: to start with, you have to supply a short name for the experiment you're starting to analyse. It's good to recap some of the stuff you'll need for the analysis. These four things are: (a) the name of the experiment :), (b) what type of files you are going to get your information from, (c) the type of analyses and output files you want to get, (d) and additional notes you may have about the design, coding or whatever data filtering you may apply; I'll take an example, and I'll use it to illustrate all the further steps:

the experiment is called (on short) Gender Agreement;

we need to import some subject data from a file, and the subject answers are in separate files;

we want a table with all the data points, a table with mean RTs for each stimulus pair and a table with normalized choice data;

we want to filter out all RTs larger than 3000ms;
You'll have to specify something like:

Experiment name:	GeAg
Input files:	- a file with subject data; - a file with the names of the PsyScope subject-answer files - 48 individual subject-answer files from PsyScope
Output files:	- a tab-separated table with all the data points - a .csv file with mean RTs by stimulus pair - a space-separated table with normalized choice data for each condition
Additional notes:	- in the full data table and subsequent analysis, only RT>3000ms are included.

This summary will be useful later on in the analysis, to make sure you got all the routines you need in the final perl script.

Additional variables

These are things like file names and other little things you may need while coding.
Note: here you should put only "constant" variables, those which will stay the same for the whole conversion script.
A special type of variable are the hashes:

Replacement variables

Also called 'associative arrays' or simply 'hashes', this kind of variables help you specify substitutions.
For example, if you coded with the letter "c" and "s" all situations in the experiment when subjects were faced with complex, respectively simple tasks, you would want to substitute "c" with "complex" and "s" with "simple"; for this reason, you would prepare a hash (%subst1) with the values ("c","complex","s","simple",). So: write the name of the hash in its input field, then for each code, repeat:

write the code in the code input field,

write the replacement value in the replacement input field,

press the "adds up" button.
After you finished with all the codes, press the "Add" button.

It's not obligatory to declare the variables at the beginning; you may figure out while formatting or recoding some data that you need a new variable, so you'll simply scroll up and add it. But enough on variables.

Let's see how we tell the script to gobble our data.

Data-input blocks

A data-input block is any amount of data which can be described together
These come in several flavors:

	plain	with header
experiment-file
subject-files

For multiple files there is also a third dimension: whether there is a need to create one file per experiment (many-to-one) or one file per subject (many-to-many).

[...]

Token-separated text

This type of file is the oldest data-interchange digital format, used in text editors, spreadsheet editors, statistical packages and so on. Whatever software you're using, you can be sure there is a way to get your data out of it in a specific-token-separated file; the basic idea is that you have some data in columns, and you replace the column boundaries with a specific token (space, comma, tab, any craracter you want, or even specific combinations of characters). The columns can be named, in which case the first row is dedicated to the column names; this is not obligatory, though. Here's an example. The data in a spreadsheet like:

Subject	Age	Sex	RT
1	22	M	1254
1	22	M	1354
1	22	M	2251
2	20	F	1143
2	20	F	1336
2	20	F	2152

would be saved on a space-separated file this way:

Subject Age Sex RT
1 22 M 1254
1 22 M 1354
1 22 M 2251
2 20 F 1143
2 20 F 1336
2 20 F 2152

This type of format is very simple and intuitive, but there is a catch: whatever you choose your separation token to be, you have to make sure that it doesn't appear in any of the columns. For example, you can't use space-separated files if the Subject column contains subject names, normally separated by spaces. This is also true for record-separators: in simple token-separated files, the record-separator is obviously the line break (aka Enter, aka CR, aka CR/LF etc.) It follows that none of the cells in such a file can contain a line break.

There are several very frequently used token-separated formats:

space-separated files: the simplest and earliest; its problem is that usually data in columns isn't of the same size and as a result you can't follow the columns by eye (as in the example above it's not immediately obvious just which column goes with wich column name); use this format when the column data HAS the same size (in characters);

.csv files: another one-character separator frequently used with columns of numbers; .csv stands for comma-delimited-variables; the idea is that each column is a variable (the name of the variable is in the first row/record); use this format when the column data HAS the same size (in characters);

solution: tab-separated files: some text editors allow you to see your data better by extending the tab size, or they even convert tab-separated chunks into tables (see Word);

MoStaCon treats all the token-separated files the same way: it ignores lines starting with # or space or tab, considering them as comment lines; the first non-comment line found is treated as a list of variable names, and these variables get the list of values listed under them. Using the example above, MoStaCon would come up with four variables (lists):
Subject={1,1,1,2,2,2}
Age={22,22,22,20,20,20}
Sex={M,M,M,F,F,F}
Subject={1254,1354,2251,1143,1336,2152}

PsyScope files: are a particular case of tab-separated files. The difference is that, by default, PsyScope also saves a header in which the PsyScope script designer records information pertaining to each run of the experiment. You have to tell MoStaCon which of

MonEx files: are similar to PsyScope files in the usage of headers, but its headers are more standardized (the experimenter has a choice of already-defined run-wide variables to choose from, like SubjectName, SubjectAge, etc.).

Data output

The purpose of this whole program is to get out of it the data in a format that can be used in further analyses.
...

Data coding and conversion

...

Script installation and handling

For MacOS:

After the script is generated in the last text area of the MoStaCon form, click anywhere in that area and from the menu 'Edit' choose 'Select All' (or press Cmd-A), then Copy (Cmd-C), open a new text file in MacPerl's 'File' menu, Paste (Cmd-V), then check the syntax (Cmd-Shft-R), then save the script in the folder where you keep the data file(s), and run the script (Cmd-R).

For Windoze:

After the script is generated in the last text area of the MoStaCon form, click anywhere in that area and from the menu 'Edit' choose 'Select All' (or right-click and 'Select All'), then Copy (again from the 'Edit' menu or the contextual menu), open a new text file in a text editor (NoteTab, if available, if not, Notepad will do).

If you use any other operating system, I bet you don't need any other advice than: put the script where the data file(s) are and run it.

Modifications to the system

Once you learn to use MoStaCon, you may want to add modules of your own making, containing specific formatting you use. There are basically two ways to achieve this:

you may send me a message describing the formatting you need; it may happen that you can do it with what's already implemented, and I may explain how;

if you have a programmer in the research team who knows basic JavaScript and PERL, he can use the MoStaCon Tools (the script is self-documented with comments).

Configurable tool for experiment data formatting and basic statistics by Radu Luchian(ov) (Presentation for the Summer School 2000)