- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.9k
Description
When using FeaturizeText, there are a couple scenarios that could result in the following exception:
System.ArgumentOutOfRangeException: Schema mismatch for input column 'Text_CharExtractor': expected Expected known-size vector of Single, got Vector
Scenario #1: The text for all the rows are string.Empty
Scenario #2: The resulting text for all the rows are string.Empty because they only contained stop words that are removed.
`var mlContext = new MLContext();
var options = new TextFeaturizingEstimator.Options()
{
StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options()
{
Language = TextFeaturizingEstimator.Language.English
},
};
// Create a small dataset as an IEnumerable.
var samples = new List()
{
// Scenario #1 - All Empty strings
new TextData(){ Text = string.Empty },
new TextData(){ Text = string.Empty },
new TextData(){ Text = string.Empty },
// Scenario #2 - All Empty strings because that's the result after removing all stop words.
//new TextData(){ Text = "is" },                
//new TextData(){ Text = "the" },
//new TextData(){ Text = "a" },
};
var dataview = mlContext.Data.LoadFromEnumerable(samples);
var textPipeline = mlContext.Transforms.Text.FeaturizeText("Text", options);
var model = textPipeline.Fit(dataview);
public class TextData
{
public string Text { get; set; }
}
`
It's fairly easy to workaround scenario #1 by inspecting the data before even calling FeaturizeText.
But the second scenario is much harder to workaround since it's happening inside ML.NET's pipeline. If you're programatically receiving various custom datasets to train on that you don't control, it's possible it has only has stop words in a given text column, and in that case you'll get the unhandled exception. Short of looking for this scenario by looking to see if the column only has stop words before calling FeaturizeText, there's not a good workaround.
Note that it will also happen "WordExtractor" (e.g. Text_WordExtractor) if you turn off creating chargram.
Also, scenario #2 doesn't happen when calling FeaturizeText without the "options" parameter, which means you're using defaults. It appears like the default may not be removing stop words despite what the documentation says.