Performance
Macro stats

Micro stats

Rank distrubution across models
For each model, we ranked each class by f1, and then plotted the distrubution of ranks across models. This gives a sense of which classes are consistently learnable, and classes for which models varied in how well they learned them

Generalization from training to test data

Similarity of models
To explore the relative contribution of each approach, we examined the correlation between f1 scores on each class for each pair of methods (supplementary methods). In general, there was high correlation between different C3POs, but less correlation between C3POs and Chebifier. This indicates that deep learning and program learning approaches are likely complementary:
