Skip to content

FeaturizeText: System.ArgumentOutOfRangeException: Schema mismatch for input column 'Text_CharExtractor': expected Expected known-size vector of Single, got Vector<Single> #7529

@dangaier

Description

@dangaier

When using FeaturizeText, there are a couple scenarios that could result in the following exception:
System.ArgumentOutOfRangeException: Schema mismatch for input column 'Text_CharExtractor': expected Expected known-size vector of Single, got Vector

Scenario #1: The text for all the rows are string.Empty
Scenario #2: The resulting text for all the rows are string.Empty because they only contained stop words that are removed.

`var mlContext = new MLContext();

var options = new TextFeaturizingEstimator.Options()
{
StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options()
{
Language = TextFeaturizingEstimator.Language.English
},
};

// Create a small dataset as an IEnumerable.
var samples = new List()
{
// Scenario #1 - All Empty strings
new TextData(){ Text = string.Empty },
new TextData(){ Text = string.Empty },
new TextData(){ Text = string.Empty },

// Scenario #2 - All Empty strings because that's the result after removing all stop words.
//new TextData(){ Text = "is" },                
//new TextData(){ Text = "the" },
//new TextData(){ Text = "a" },

};

var dataview = mlContext.Data.LoadFromEnumerable(samples);
var textPipeline = mlContext.Transforms.Text.FeaturizeText("Text", options);
var model = textPipeline.Fit(dataview);

public class TextData
{
public string Text { get; set; }
}
`

It's fairly easy to workaround scenario #1 by inspecting the data before even calling FeaturizeText.
But the second scenario is much harder to workaround since it's happening inside ML.NET's pipeline. If you're programatically receiving various custom datasets to train on that you don't control, it's possible it has only has stop words in a given text column, and in that case you'll get the unhandled exception. Short of looking for this scenario by looking to see if the column only has stop words before calling FeaturizeText, there's not a good workaround.

Note that it will also happen "WordExtractor" (e.g. Text_WordExtractor) if you turn off creating chargram.
Also, scenario #2 doesn't happen when calling FeaturizeText without the "options" parameter, which means you're using defaults. It appears like the default may not be removing stop words despite what the documentation says.

Related to:
#6621
#5714

Metadata

Metadata

Assignees

No one assigned

    Labels

    untriagedNew issue has not been triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions