Model: gemini-3-pro-preview

📋 Dashboard Overview

📋 Configuration Summaries

⚙️ text_active_think

📊 Samples: 10

🔍 Exploration

avg_node_coverage: 1
avg_edge_coverage: 0.512
avg_exploration_steps: 14.300
avg_action_cost: 13.300
avg_action_fail_ratio: 0.027
avg_valid_action_ratio: 1
avg_final_information_gain: 0.821
avg_false_belief_steps: 8.300
avg_false_belief_f1: 0.975
avg_false_belief_f1_position: 0.980
avg_false_belief_f1_facing: 0.967
avg_false_belief_action_cost: 7.100
avg_false_belief_action_cost_after_seen: 0.111
avg_action_counts:
move: 5.300
rotate: 9.600
return: 0
observe: 13.300
term: 1
forced_term: 0
query: 0

✅ Evaluation

avg_accuracy: 0.801
task_metrics:
DirectionEvaluationTask:
accuracy: 0.750
total_count: 30
task_score: 22.500
PovEvaluationTask:
accuracy: 0.750
total_count: 30
task_score: 22.500
BackwardPovTextEvaluationTask:
accuracy: 0.933
total_count: 30
task_score: 28
View2ActionTextEvaluationTask:
accuracy: 0.767
total_count: 30
task_score: 23
AlloMappingEvaluationTask:
accuracy: 0.847
total_count: 30
task_score: 25.402
RotEvaluationTask:
accuracy: 0.933
total_count: 30
task_score: 28
Location2ViewEvaluationTask:
accuracy: 0.667
total_count: 30
task_score: 20
View2LocationTextEvaluationTask:
accuracy: 0.792
total_count: 30
task_score: 23.765
Action2ViewEvaluationTask:
accuracy: 0.767
total_count: 30
task_score: 23

✅ Evaluation (prompt_cogmap)

avg_accuracy: 0.752
task_metrics:
DirectionEvaluationTask:
accuracy: 0.633
total_count: 30
task_score: 19
PovEvaluationTask:
accuracy: 0.700
total_count: 30
task_score: 21
BackwardPovTextEvaluationTask:
accuracy: 0.933
total_count: 30
task_score: 28
Action2ViewEvaluationTask:
accuracy: 0.767
total_count: 30
task_score: 23
View2ActionTextEvaluationTask:
accuracy: 0.700
total_count: 30
task_score: 21
AlloMappingEvaluationTask:
accuracy: 0.800
total_count: 30
task_score: 24.006
RotEvaluationTask:
accuracy: 0.867
total_count: 30
task_score: 26
Location2ViewEvaluationTask:
accuracy: 0.633
total_count: 30
task_score: 19
View2LocationTextEvaluationTask:
accuracy: 0.736
total_count: 30
task_score: 22.071

✅ Evaluation (use_gt_cogmap)

avg_accuracy: 0.957
task_metrics:
RotEvaluationTask:
accuracy: 1
total_count: 30
task_score: 30
Location2ViewEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
View2LocationTextEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29.016
DirectionEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
PovEvaluationTask:
accuracy: 0.950
total_count: 30
task_score: 28.500
BackwardPovTextEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
Action2ViewEvaluationTask:
accuracy: 1
total_count: 30
task_score: 30
View2ActionTextEvaluationTask:
accuracy: 0.833
total_count: 30
task_score: 25
AlloMappingEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29

✅ Evaluation (use_model_cogmap)

avg_accuracy: 0.755
task_metrics:
DirectionEvaluationTask:
accuracy: 0.683
total_count: 30
task_score: 20.500
PovEvaluationTask:
accuracy: 0.667
total_count: 30
task_score: 20
BackwardPovTextEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
Action2ViewEvaluationTask:
accuracy: 0.767
total_count: 30
task_score: 23
View2ActionTextEvaluationTask:
accuracy: 0.767
total_count: 30
task_score: 23
AlloMappingEvaluationTask:
accuracy: 0.788
total_count: 30
task_score: 23.630
RotEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
Location2ViewEvaluationTask:
accuracy: 0.467
total_count: 30
task_score: 14
View2LocationTextEvaluationTask:
accuracy: 0.723
total_count: 30
task_score: 21.681

🧠 Cognitive Map

exploration:
consistency:
facing_stability_avg: 0.926
local_vs_global_avg:
dir: 0.806
pos: 0.860
facing: 0.947
overall: 0.871
facing_update_avg: 0.926
position_update_avg: 0.661
position_stability_avg: 0.661
correctness:
last_global_vs_gt_full:
dir: 0.777
pos: 0.742
facing: 1
overall: 0.840
error:
local_vs_gt_local_avg:
dir: 0.892
pos: 0.890
facing: 1
overall: 0.927
newly_observed_vs_gt_local_avg:
dir: 0.897
pos: 0.883
facing: 1
overall: 0.926
agent_vs_gt_agent_avg:
dir: 0.833
pos: 0.870
facing: 1
overall: 0.901
global_vs_gt_global_avg:
dir: 0.798
pos: 0.799
facing: 0.937
overall: 0.845
n_samples: Global: 10, Local: 10, Newly: 10
evaluation:
correctness:
(none)

🌫️ Fog Probe

precision_avg: 0.777
f1_avg: 0.749
recall_avg: 0.755
n_samples: 10

🧭 False Belief CogMap

inertia: 0.037
changed:
dir: None
pos: 0.748
facing: 0.850
overall: None
retention:
dir: None
pos: 0.351
facing: 0.150
overall: None
unchanged:
dir: 0.695
pos: 0.714
facing: 0.824
overall: 0.744
unchanged_retention:
dir: None
pos: 0.758
facing: 0.943
overall: None
unchanged_retention_minus_retention:
pos: 0.764
facing: 0.819
unchanged_exploration:
dir: 0.796
pos: 0.762
facing: 1
overall: 0.853

📈 Correlation

cogmap_acc_correlations:
avg_accuracy:
pearson_r: 0.426
p_value: 0.220
significant: False
n_samples: 10
Location2ViewEvaluationTask:
pearson_r: 0.049
p_value: 0.894
significant: False
n_samples: 10
View2LocationTextEvaluationTask:
pearson_r: -0.146
p_value: 0.688
significant: False
n_samples: 10
Action2ViewEvaluationTask:
pearson_r: 0.802
p_value: 0.005
significant: True
n_samples: 10
RotDualEvaluationTask:
pearson_r: -0.174
p_value: 0.631
significant: False
n_samples: 10
AlloMappingEvaluationTask:
pearson_r: 0.232
p_value: 0.519
significant: False
n_samples: 10
PovEvaluationTask:
pearson_r: 0.534
p_value: 0.112
significant: False
n_samples: 10
DirectionEvaluationTask:
pearson_r: -0.265
p_value: 0.460
significant: False
n_samples: 10
RotEvaluationTask:
pearson_r: -0.174
p_value: 0.631
significant: False
n_samples: 10
BackwardPovTextEvaluationTask:
pearson_r: -0.380
p_value: 0.279
significant: False
n_samples: 10
View2ActionTextEvaluationTask:
pearson_r: 0.120
p_value: 0.741
significant: False
n_samples: 10
cogmap_infogain_correlation:
pearson_r: 0.591
p_value: 0.072
significant: False
n_samples: 10
n_samples: 10
Performance Charts
Information Gain per Turn
Information Gain per Turn
Cognitive Map (Update)
Cognitive Map Update Turn Averages
Cognitive Map (Full)
Cognitive Map Full Turn Averages
Cognitive Map (Self-Tracking)
Cognitive Map Self-Tracking Turn Averages
Fog Probe F1
Fog Probe F1 per Turn
Fog Probe Precision
Fog Probe Precision per Turn
Fog Probe Recall
Fog Probe Recall per Turn
FB CogMap (Unchanged)
False Belief CogMap Unchanged per Turn
Position Update
Position Update per Turn
Facing Update
Facing Update per Turn
Position Stability
Position Stability per Turn
Facing Stability
Facing Stability per Turn
CogMap vs Accuracy
Cognitive Map vs Accuracy Correlation
CogMap vs InfoGain
Cognitive Map vs Information Gain Correlation

⚙️ text_passive_think_strategist

📊 Samples: 10

🔍 Exploration

avg_node_coverage: None
avg_edge_coverage: None
avg_exploration_steps: None
avg_action_cost: None
avg_action_fail_ratio: None
avg_valid_action_ratio: None
avg_final_information_gain: None
avg_false_belief_steps: None
avg_false_belief_f1: None
avg_false_belief_f1_position: None
avg_false_belief_f1_facing: None
avg_false_belief_action_cost: None
avg_false_belief_action_cost_after_seen: None

✅ Evaluation

avg_accuracy: 0.849
task_metrics:
DirectionEvaluationTask:
accuracy: 0.817
total_count: 30
task_score: 24.500
PovEvaluationTask:
accuracy: 0.950
total_count: 30
task_score: 28.500
BackwardPovTextEvaluationTask:
accuracy: 0.933
total_count: 30
task_score: 28
View2ActionTextEvaluationTask:
accuracy: 0.667
total_count: 30
task_score: 20
AlloMappingEvaluationTask:
accuracy: 0.873
total_count: 30
task_score: 26.187
RotEvaluationTask:
accuracy: 0.867
total_count: 30
task_score: 26
Location2ViewEvaluationTask:
accuracy: 0.817
total_count: 30
task_score: 24.500
View2LocationTextEvaluationTask:
accuracy: 0.865
total_count: 30
task_score: 25.940
Action2ViewEvaluationTask:
accuracy: 0.850
total_count: 30
task_score: 25.500

🧠 Cognitive Map

exploration:
correctness:
global_full:
(none)
n_samples: 0

⚙️ vision_active_think

📊 Samples: 10

🔍 Exploration

avg_node_coverage: 1
avg_edge_coverage: 0.534
avg_exploration_steps: 15
avg_action_cost: 14
avg_action_fail_ratio: 0.046
avg_valid_action_ratio: 1
avg_final_information_gain: 0.851
avg_false_belief_steps: 11.500
avg_false_belief_f1: 0.600
avg_false_belief_f1_position: 0.813
avg_false_belief_f1_facing: 0.217
avg_false_belief_action_cost: 10.200
avg_false_belief_action_cost_after_seen: 3.200
avg_action_counts:
move: 5.300
rotate: 9.800
return: 0
observe: 14
term: 1
forced_term: 0
query: 0

✅ Evaluation

avg_accuracy: 0.608
task_metrics:
DirectionEvaluationTask:
accuracy: 0.567
total_count: 30
task_score: 17
PovEvaluationTask:
accuracy: 0.400
total_count: 30
task_score: 12
BackwardPovTextEvaluationTask:
accuracy: 0.767
total_count: 30
task_score: 23
BackwardPovVisionEvaluationTask:
accuracy: 0.633
total_count: 30
task_score: 19
View2ActionTextEvaluationTask:
accuracy: 0.567
total_count: 30
task_score: 17
View2ActionVisionEvaluationTask:
accuracy: 0.667
total_count: 30
task_score: 20
AlloMappingEvaluationTask:
accuracy: 0.650
total_count: 30
task_score: 19.489
RotEvaluationTask:
accuracy: 0.833
total_count: 30
task_score: 25
Location2ViewEvaluationTask:
accuracy: 0.467
total_count: 30
task_score: 14
View2LocationTextEvaluationTask:
accuracy: 0.654
total_count: 30
task_score: 19.629
View2LocationVisionEvaluationTask:
accuracy: 0.553
total_count: 30
task_score: 16.584
Action2ViewEvaluationTask:
accuracy: 0.567
total_count: 30
task_score: 17

✅ Evaluation (prompt_cogmap)

avg_accuracy: 0.556
task_metrics:
DirectionEvaluationTask:
accuracy: 0.700
total_count: 30
task_score: 21
PovEvaluationTask:
accuracy: 0.300
total_count: 30
task_score: 9
BackwardPovTextEvaluationTask:
accuracy: 0.667
total_count: 30
task_score: 20
BackwardPovVisionEvaluationTask:
accuracy: 0.533
total_count: 30
task_score: 16
Action2ViewEvaluationTask:
accuracy: 0.483
total_count: 30
task_score: 14.500
View2ActionTextEvaluationTask:
accuracy: 0.467
total_count: 30
task_score: 14
View2ActionVisionEvaluationTask:
accuracy: 0.733
total_count: 30
task_score: 22
AlloMappingEvaluationTask:
accuracy: 0.568
total_count: 30
task_score: 17.054
RotEvaluationTask:
accuracy: 0.700
total_count: 30
task_score: 21
Location2ViewEvaluationTask:
accuracy: 0.450
total_count: 30
task_score: 13.500
View2LocationTextEvaluationTask:
accuracy: 0.666
total_count: 30
task_score: 19.986
View2LocationVisionEvaluationTask:
accuracy: 0.545
total_count: 30
task_score: 16.359

✅ Evaluation (use_gt_cogmap)

avg_accuracy: 0.963
task_metrics:
DirectionEvaluationTask:
accuracy: 0.917
total_count: 30
task_score: 27.500
PovEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
BackwardPovTextEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
BackwardPovVisionEvaluationTask:
accuracy: 0.967
total_count: 30
task_score: 29
Action2ViewEvaluationTask:
accuracy: 0.933
total_count: 30
task_score: 28
View2ActionTextEvaluationTask:
accuracy: 0.900
total_count: 30
task_score: 27
View2ActionVisionEvaluationTask:
accuracy: 0.800
total_count: 30
task_score: 24
AlloMappingEvaluationTask:
accuracy: 1
total_count: 30
task_score: 30
RotEvaluationTask:
accuracy: 1
total_count: 30
task_score: 30
Location2ViewEvaluationTask:
accuracy: 1
total_count: 30
task_score: 30
View2LocationTextEvaluationTask:
accuracy: 0.988
total_count: 30
task_score: 29.639
View2LocationVisionEvaluationTask:
accuracy: 0.761
total_count: 30
task_score: 22.831

✅ Evaluation (use_model_cogmap)

avg_accuracy: 0.554
task_metrics:
DirectionEvaluationTask:
accuracy: 0.567
total_count: 30
task_score: 17
PovEvaluationTask:
accuracy: 0.300
total_count: 30
task_score: 9
BackwardPovTextEvaluationTask:
accuracy: 0.667
total_count: 30
task_score: 20
BackwardPovVisionEvaluationTask:
accuracy: 0.700
total_count: 30
task_score: 21
Action2ViewEvaluationTask:
accuracy: 0.533
total_count: 30
task_score: 16
View2ActionTextEvaluationTask:
accuracy: 0.500
total_count: 30
task_score: 15
View2ActionVisionEvaluationTask:
accuracy: 0.667
total_count: 30
task_score: 20
AlloMappingEvaluationTask:
accuracy: 0.629
total_count: 30
task_score: 18.861
RotEvaluationTask:
accuracy: 0.700
total_count: 30
task_score: 21
Location2ViewEvaluationTask:
accuracy: 0.467
total_count: 30
task_score: 14
View2LocationTextEvaluationTask:
accuracy: 0.622
total_count: 30
task_score: 18.657
View2LocationVisionEvaluationTask:
accuracy: 0.557
total_count: 30
task_score: 16.697

🧠 Cognitive Map

exploration:
consistency:
facing_stability_avg: 0.597
local_vs_global_avg:
dir: 0.677
pos: 0.689
facing: 0.538
overall: 0.635
facing_update_avg: 0.597
position_update_avg: 0.630
position_stability_avg: 0.630
correctness:
last_global_vs_gt_full:
dir: 0.670
pos: 0.639
facing: 0.299
overall: 0.536
error:
local_vs_gt_local_avg:
dir: 0.733
pos: 0.681
facing: 0.437
overall: 0.617
newly_observed_vs_gt_local_avg:
dir: 0.696
pos: 0.671
facing: 0.328
overall: 0.565
agent_vs_gt_agent_avg:
dir: 0.675
pos: 0.728
facing: 0.991
overall: 0.798
global_vs_gt_global_avg:
dir: 0.663
pos: 0.657
facing: 0.401
overall: 0.574
n_samples: Global: 10, Local: 10, Newly: 10
evaluation:
correctness:
(none)

🌫️ Fog Probe

precision_avg: 0.732
f1_avg: 0.684
recall_avg: 0.707
n_samples: 10

🧭 False Belief CogMap

inertia: 0.135
changed:
dir: None
pos: 0.657
facing: 0.400
overall: None
retention:
dir: None
pos: 0.349
facing: 0.450
overall: None
unchanged:
dir: 0.588
pos: 0.607
facing: 0.326
overall: 0.507
unchanged_retention:
dir: None
pos: 0.727
facing: 0.669
overall: None
unchanged_retention_minus_retention:
pos: 0.745
facing: 0.017
unchanged_exploration:
dir: 0.656
pos: 0.650
facing: 0.186
overall: 0.497

📈 Correlation

cogmap_acc_correlations:
avg_accuracy:
pearson_r: 0.748
p_value: 0.013
significant: True
n_samples: 10
BackwardPovVisionEvaluationTask:
pearson_r: 0.771
p_value: 0.009
significant: True
n_samples: 10
View2LocationVisionEvaluationTask:
pearson_r: 0.339
p_value: 0.338
significant: False
n_samples: 10
Location2ViewEvaluationTask:
pearson_r: 0.236
p_value: 0.511
significant: False
n_samples: 10
View2LocationTextEvaluationTask:
pearson_r: 0.636
p_value: 0.048
significant: True
n_samples: 10
Action2ViewEvaluationTask:
pearson_r: 0.550
p_value: 0.100
significant: False
n_samples: 10
RotDualEvaluationTask:
pearson_r: 0.492
p_value: 0.149
significant: False
n_samples: 10
AlloMappingEvaluationTask:
pearson_r: 0.597
p_value: 0.069
significant: False
n_samples: 10
PovEvaluationTask:
pearson_r: 0.383
p_value: 0.274
significant: False
n_samples: 10
DirectionEvaluationTask:
pearson_r: -0.339
p_value: 0.338
significant: False
n_samples: 10
RotEvaluationTask:
pearson_r: -0.051
p_value: 0.890
significant: False
n_samples: 10
BackwardPovTextEvaluationTask:
pearson_r: 0.728
p_value: 0.017
significant: True
n_samples: 10
View2ActionTextEvaluationTask:
pearson_r: 0.446
p_value: 0.196
significant: False
n_samples: 10
View2ActionVisionEvaluationTask:
pearson_r: 0.239
p_value: 0.507
significant: False
n_samples: 10
cogmap_infogain_correlation:
pearson_r: 0.498
p_value: 0.143
significant: False
n_samples: 10
n_samples: 10
Performance Charts
Information Gain per Turn
Information Gain per Turn
Cognitive Map (Update)
Cognitive Map Update Turn Averages
Cognitive Map (Full)
Cognitive Map Full Turn Averages
Cognitive Map (Self-Tracking)
Cognitive Map Self-Tracking Turn Averages
Fog Probe F1
Fog Probe F1 per Turn
Fog Probe Precision
Fog Probe Precision per Turn
Fog Probe Recall
Fog Probe Recall per Turn
FB CogMap (Unchanged)
False Belief CogMap Unchanged per Turn
Position Update
Position Update per Turn
Facing Update
Facing Update per Turn
Position Stability
Position Stability per Turn
Facing Stability
Facing Stability per Turn
CogMap vs Accuracy
Cognitive Map vs Accuracy Correlation
CogMap vs InfoGain
Cognitive Map vs Information Gain Correlation

⚙️ vision_passive_think_scout

📊 Samples: 10

🔍 Exploration

avg_node_coverage: None
avg_edge_coverage: None
avg_exploration_steps: None
avg_action_cost: None
avg_action_fail_ratio: None
avg_valid_action_ratio: None
avg_final_information_gain: None
avg_false_belief_steps: None
avg_false_belief_f1: None
avg_false_belief_f1_position: None
avg_false_belief_f1_facing: None
avg_false_belief_action_cost: None
avg_false_belief_action_cost_after_seen: None

✅ Evaluation

avg_accuracy: 0.617
task_metrics:
Location2ViewEvaluationTask:
accuracy: 0.683
total_count: 30
task_score: 20.500
View2LocationTextEvaluationTask:
accuracy: 0.716
total_count: 30
task_score: 21.494
View2LocationVisionEvaluationTask:
accuracy: 0.572
total_count: 30
task_score: 17.170
DirectionEvaluationTask:
accuracy: 0.583
total_count: 30
task_score: 17.500
PovEvaluationTask:
accuracy: 0.417
total_count: 30
task_score: 12.500
BackwardPovTextEvaluationTask:
accuracy: 0.567
total_count: 30
task_score: 17
BackwardPovVisionEvaluationTask:
accuracy: 0.533
total_count: 30
task_score: 16
View2ActionTextEvaluationTask:
accuracy: 0.500
total_count: 30
task_score: 15
View2ActionVisionEvaluationTask:
accuracy: 0.567
total_count: 30
task_score: 17
AlloMappingEvaluationTask:
accuracy: 0.685
total_count: 30
task_score: 20.559
RotEvaluationTask:
accuracy: 0.833
total_count: 30
task_score: 25
Action2ViewEvaluationTask:
accuracy: 0.567
total_count: 30
task_score: 17

🧠 Cognitive Map

exploration:
correctness:
global_full:
(none)
n_samples: 0

📖 Sample Navigation