Exploring grouped data with gramm
In this example file, we will go further in exploring gramm's capabilities for data where the independent variables are categorical / group data.
To benefit from interactive elements, you should open it in MATLAB's editor with
We will load a partial dataset from a human movement science experiment
websave('example_movement','https://github.com/piermorel/gramm/raw/master/sample_data/example_movement.mat'); %Download data from repository
load example_movement.mat
T
T = 3170×15 table
| | subject | session | trial_index | reference_direction | hit | m_movement_duration | m_dist | m_reaction_time | valid_perc | valid_perc_session | px | py | t | tperc |
|---|
| 1 | 2 | 3 |
|---|
| 1 | IHTA | 1 | 2 | 105 | 0 | 3.0020e+03 | 201.1152 | 616.7897 | -20.7055 | 37.2741 | 0 | 0.1634 | 0.1634 | 1×362 double | 1×362 double | 1×362 double | 1×362 double |
|---|
| 2 | IHTA | 1 | 3 | 60 | 1 | 2.6261e+03 | 487.0262 | 404.2587 | 40 | 29.2820 | 0 | 0.3268 | 0.3268 | 1×317 double | 1×317 double | 1×317 double | 1×317 double |
|---|
| 3 | IHTA | 1 | 4 | 330 | 0 | 3.0016e+03 | 483.2827 | 341.6924 | 69.2820 | -80 | 0 | 0.4902 | 0.4902 | 1×362 double | 1×362 double | 1×362 double | 1×362 double |
|---|
| 4 | IHTA | 1 | 7 | 240 | 1 | 1.8670e+03 | 323.5568 | 303.2130 | -40 | -109.2820 | 0 | 0.8170 | 0.8170 | 1×226 double | 1×226 double | 1×226 double | 1×226 double |
|---|
| 5 | IHTA | 1 | 9 | 15 | 1 | 2.8925e+03 | 638.7513 | 283.2674 | 77.2741 | -19.2945 | 0 | 1.1438 | 1.1438 | 1×349 double | 1×349 double | 1×349 double | 1×349 double |
|---|
| 6 | IHTA | 1 | 14 | 150 | 0 | 3.0024e+03 | 632.5109 | 306.7775 | -69.2820 | -0 | 0 | 1.7974 | 1.7974 | 1×362 double | 1×362 double | 1×362 double | 1×362 double |
|---|
| 7 | IHTA | 1 | 16 | 60 | 1 | 1.0567e+03 | 96.0046 | 294.8469 | 40 | 29.2820 | 0 | 1.9608 | 1.9608 | 1×129 double | 1×129 double | 1×129 double | 1×129 double |
|---|
| 8 | IHTA | 1 | 18 | 240 | 1 | 1.3083e+03 | 293.7532 | 320.3605 | -40 | -109.2820 | 0 | 2.2876 | 2.2876 | 1×159 double | 1×159 double | 1×159 double | 1×159 double |
|---|
| 9 | IHTA | 1 | 19 | 150 | 0 | 3.0020e+03 | 556.4195 | 367.6910 | -69.2820 | -0 | 0 | 2.4510 | 2.4510 | 1×362 double | 1×362 double | 1×362 double | 1×362 double |
|---|
| 10 | IHTA | 1 | 27 | 195 | 1 | 2.3428e+03 | 468.3147 | 309.4387 | -77.2741 | -60.7055 | 0 | 2.9412 | 2.9412 | 1×283 double | 1×283 double | 1×283 double | 1×283 double |
|---|
| 11 | IHTA | 1 | 29 | 15 | 1 | 1.3741e+03 | 236.0695 | 332.9378 | 77.2741 | -19.2945 | 0 | 3.1046 | 3.1046 | 1×167 double | 1×167 double | 1×167 double | 1×167 double |
|---|
| 12 | IHTA | 1 | 34 | 330 | 0 | 2.4416e+03 | 346.7274 | 369.4775 | 69.2820 | -80 | 0 | 3.4314 | 3.4314 | 1×295 double | 1×295 double | 1×295 double | 1×295 double |
|---|
| 13 | IHTA | 1 | 37 | 105 | 0 | 3.0028e+03 | 893.2468 | 337.2823 | -20.7055 | 37.2741 | 0 | 3.7582 | 3.7582 | 1×362 double | 1×362 double | 1×362 double | 1×362 double |
|---|
| 14 | IHTA | 1 | 40 | 195 | 0 | 3.0028e+03 | 782.8285 | 423.8782 | -77.2741 | -60.7055 | 0 | 3.9216 | 3.9216 | 1×362 double | 1×362 double | 1×362 double | 1×362 double |
|---|
| ⋮ |
|---|
In this dataset, we have four different subjects (subject), each coming for two sessions (session) on consecutive days at the lab. During each of these sessions they learn to control the displacement of a cursor on a screen, and their task is to reach targets with the cursor. The targets are arranged at discrete angles (reference_direction) in a circle around a starting point. The cursor is difficult to control, and as a marker for progress in the task, we record whether they reach the target in time (hit) and how long was their reaction time (m_reaction_time). Each line corresponds to a trial (trial_index), and we transformed the index in percentage of trials performed within session (valid_perc, goes from 0 to 100% in each session) or across sessions (valid_perc_session, goes from 0 to 200% across both sessions).
Using categorical data on the x axis
First let's examine the progress (using the reaction time m_reaction_time) between session for each subject. With gramm it's possible to use categorical data on the x axis and thus reproduce typical raw data plots or statistical data plots that would accompany analyses such as ANOVAs.
Interactive parameter: To avoid points from both sessions to overlap, we use the 'dodge' parameter in geom_point(). The numerical value indicates the spacing along the x axis used to avoid the overlap.
g=gramm('x',T.subject,'y',T.m_reaction_time,'color',T.session);
g.geom_point('dodge',0.3);
g.set_names('x','Subject','y','Reaction time (ms)','color','Session');
Here we see that even with using the 'dodge' argument, the basic geom_point() is limited becaus of the overlap between points. We have two other geom_ methods that can make this better.
Improving the visualization of raw datapoints
Randomly jitter points with geom_jitter() and dodge graphical elements
A first option is to use geom_jitter() instead of geom_point() so that the datapoins are jittered along the x-axis.
Interactive parameters:
- The 'width' parameter sets the width along the x axis used for the visualization (here the width of jittering). Setting it below the value used for 'dodge' allows for a small spacing between points for session 1 and 2. Setting a larger value could make the points overlap. Most geom_ and stat_ methods have these parameters set with usable default values as used later, but they often require tweaking depending on the complexity of your data and figure size.
- We can also make the individual points transparent with the 'alpha' parameter
figure('Position',[100 100 800 500])
g=gramm('x',T.subject,'y',T.m_reaction_time,'color',T.session);
g.geom_jitter('dodge',0.6,'width',0.5,'alpha',0.3);
g.set_names('x','Subject','y','Reaction time (ms)','color','Session');
Here we better visualize the underlying distribution, however it's still difficult to get an idea with the mass of points in the middle of the distribution.
Create a beeswarm plot with geom_swarm()
Another option to display raw datapoints is to to use a swarm plot, which stacks datapoints horizontally
Interactive parameters:
- Here we set the 'point_size' parameter for geom_swarm() given the large amount of points in our dataset. geom_swarm is designed so that points within a group never overlap, so big points would make each swarm large.
- The 'type' parameter configures the way the swarm is constructed
- The 'corral' parameter configures what happens to points that are placed further to the left and right than the width.
figure('Position',[100 100 800 500])
g=gramm('x',T.subject,'y',T.m_reaction_time,'color',T.session);
g.geom_swarm('alpha',0.5,'point_size',1.5,'type','up','corral','none');
g.set_names('x','Subject','y','Reaction time (ms)','color','Session');
Adding statistics layers
Now that our we displayed our raw dataset, we can add more statistics oriented visualizations to our graphs. Note that all stat_ layers could be combined with geom_ layers or together.
Compare distributions vertically or horizontally
gramm provides two common statistical visualizations for comparing the distributions of grouped data: box and whisker plots or violin plots. You can pick one below with the first dropdown menu.
Interactive parameters:
- The coord_flip() button runs the corresponding method, which allows to flip x an y axes and thus to represent have horizontal visualizations.
- Box and whisker plots can have a 'notch' or not
- When comparing only two groups, violin plots can be set to show only half violins with 'half'
- The 'fill' option allows to pick between different styles
- The 'normalization' option is used to set up how each violin width is normalized in order to be able to compare groups with different sizes
g=gramm('x',T.subject,'y',T.m_reaction_time,'color',T.session);
g.stat_boxplot('notch',false);
g.stat_violin('width',0.5,'half',false,'fill',"transparent",'normalization','area');
g.set_names('x','Subject','y','Reaction time (ms)','color','Session');
Summarize data
The stat_summary() layer can represent different descriptive statistics with various types of graphical elements : bars, points, errorbars, lines, shaded areas, etc. This layer is closest to the output of a statistical test such as an ANOVA or t-test. By default it represents the mean and 95% confidence interval of the mean for the group. Note that the 95% confidence interval is computed independently for each group as gramm can't know your experimental design (no multiple comparison correction).
Interactive parameters:
- The 'geom' parameter specifies how the descriptive statistics are represented. The parameter can be given as a single string or as a cell of strings to combine them together. Here you can try a combination of two
- The 'setylim' parameter determines whether the Y scale depends on the summay only or encompasses the whole dataset
- The 'type' parameter specifies which descriptive statistics are used and how they are computed. The defaults 95% confidence interval assumes a normal distribution, but other disributions can be picked or a bootsrapped confidence interval.
- stat_summary() can be used for continuous or time series data. These uses are detailed in the corresponding live scripts.
g=gramm('x',T.subject,'y',T.m_reaction_time,'color',T.session);
g.stat_summary('geom',{'bar','black_errorbar'},'setylim',true,'type','ci');
g.set_names('x','Subject','y','Reaction time (ms)','color','Session');
Overall this figure confirms the large between-subject variability in the reaction time and shows that all subjects have a lower reaction time on the second day.
Advanced example
In this last figure, we will overlay on top of the swarm plot the group median using stat_summary(). After the drawing is done, we will access the handles of the graphical elements through the results structure within the gramm object to make the medians more visible
figure('Position',[100 100 800 500])
g=gramm('x',T.subject,'y',T.m_reaction_time,'color',T.session);
g.geom_swarm('alpha',0.5,'point_size',1.5);
g.stat_summary('dodge',0.7,'geom','black_point','type','quartile');
g.set_names('x','Subject','y','Reaction time (ms)','color','Session');
% Most of the data and graphic handles created by layers can be accessed
% through the results structure
set([g.results.stat_summary.point_handle],'Marker','s');
set([g.results.stat_summary.point_handle],'MarkerSize',10);
%Export
ans = '/private/var/folders/7y/fc4pvx655qg6k9bmc55sd1dh0000gn/T/Editor_xmofo/LiveEditorEvaluationHelperE1779447240'
g.export('file_name','groups_export','file_type','png');