Skip to content

KeyError when removing long examples after removing duplicate rows #121

Closed
@serinamarie

Description

@serinamarie

Error:

openai tools fine_tunes.prepare_data -f training_data_2022-09-14.jsonl
Analyzing...

- Your file contains 2446 prompt-completion pairs
Based on the analysis we will perform the following actions:
- [Recommended] Remove 1155 duplicate rows [Y/n]: y
- [Recommended] Remove 49 long examples [Y/n]: y
Traceback (most recent call last):
  File "/Users/ser/project/project-venv/bin/openai", line 8, in <module>
    sys.exit(main())
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/openai/_openai_scripts.py", line 63, in main
    args.func(args)
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/openai/cli.py", line 531, in prepare_data
    apply_validators(
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/openai/validators.py", line 851, in apply_validators
    df, optional_applied = apply_optional_remediation(
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/openai/validators.py", line 578, in apply_optional_remediation
    df = remediation.optional_fn(df)
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/openai/validators.py", line 171, in optional_fn
    return x.drop(long_indexes)
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/pandas/core/frame.py", line 4957, in drop
    return super().drop(
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/pandas/core/generic.py", line 4267, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/pandas/core/generic.py", line 4311, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "/Users/ser/project/project-venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6661, in drop
    raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: '[330, 352, 377, 378, 422, 424, 435, 1172, 1194, 1219, 1220, 1264, 1266, 1277, 1468, 1498, 1549, 1641, 1648, 1714, 1741, 1816, 1859, 1984] not found in axis'

I believe that since the duplicate rows were removed, many of the long examples are missing, throwing this error. And thus I end up needing to apply the first recommendation and not the second one, and then use the resulting file to apply the second recommendation.

It would be great to be able to apply both changes to the same file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions