IBM InstructLab

Increasing accuracy, efficiency, and usability in large-scale data review for LLM (large language model) training.

THIS WORK IS NDA PROTECTED

If you would like to learn more, please reach out!

THIS WORK IS NDA PROTECTED

If you would like to learn more, please reach out to me!

If you would like to learn more, please reach out to me!

Timeline

Timeline

8 weeks

Role

Role

Product Designer

Team

Team

1 project lead, 5 designers

Deliverables

Deliverables

High-fidelity Figma prototype

Background

Instructlab is an open source IBM project, training and fine tuning enterprise-level LLMs with synthetically generated data on their flagship watsonx.ai product.

This allows enterprises to create and train custom models to perform on tasks such as data analysis, customer service, etc.

Challenge

Currently, InstructLab does not have a consistent way to review the hundreds of sets of training data being fed into the model.

Instead, individual teams have to rely on irregular and manual review processes to refine their models.

Results

We designed a smoother, more intuitive workflow for reviewing synthetic data, developing features like modular toggles, collaboration tools, and source traceability to make the review process more efficient and transparent.

Key Problem: Reviewing synthetic data in InstructLab is unstructured, laborious, and full of bottlenecks.

01. Lack of a centralized review process… 

made reviewing and collaborating especially difficult 

02. Users needed a way to route reviews to the right expert…

or else some questions were left partially or completely unreviewed.

03. There was no clear collaborative system… 

so users had to self-assign and review thousands of questions without structured coordination.

01. Lack of a centralized review process… 

made reviewing and collaborating especially difficult 

02. Users needed a way to route reviews to the right expert…

or else some questions were left partially or completely unreviewed.

03. There was no clear collaborative system… 

so users had to self-assign and review thousands of questions without structured coordination.

How might we allow teams of reviewers to efficiently and collaboratively approve or deny sets of synthetic data?

The first version of our redesigned flow.

Redesigning the Synthetic Data Generation (SDG) Review Process

We started by reimagining the data viewing experience. Assuming that most reviewing teams were highly collaborative, we generated iterations of the commenting feature, emphasizing the ability to reference the discussion during the review process and include tag other commenters in the process if a user was unsure about the content. 

Using the current watsonx.ai interfaces as a reference and with continuous feedback from two watsonx.ai UX designers + one InstructLab developer, we designed screens encompassing the three prioritized features: Collaborative Team Tools (filtering, commenting), List and Modular Views, and Approving, Denying, Editing.

A glimpse into our low-fidelity mock-ups. Here, we're exploring the possibilities of the modular view.

How Feedback Reshaped Our Priorities

With our mid-fidelity prototype, our team gathered critical user feedback that reshaped our approach for the next iteration.

  • We learned that data review is a much more individualized process than we had anticipated. Introducing collaborating tools without proper consideration could disrupt, rather than bolster, the team's natural workflows.

  • Reviewers frequently leaned on the reference document to evaluate the quality of the synthetically generated data, a need our initial design did not prioritize.

  • Reviewers also strongly preferred the modular view over the list view for its higher functionality, but found toggling between the two screens unintuitive in our mid-fidelity interface.

These insights helped us prioritize three key areas for our next iteration, making navigation more intuitive, improving commenting, and increasing the accessibility of the reference document.

Key Improvements: Translating Feedback into Frictionless Design

While high fidelity designs cannot be shared, here are the specific adjustments that we prioritized to better align with our users:

Navigation:

  • Introduced a toggle icon to clearly indicate transition between views.

  • Added a list view icon to indicate current page.

  • Designed a colorful animated transition to clearly signal the shift to modular view.

Commenting:

  • Implemented a minimally invasive comment display to avoid disrupting the reviewing process.

  • Redesigned the comment modal to support short interactions between reviewers.

Reference Documents:

  • Embedded a reference document for each question in modular view.

  • Added a PDF search tool for convenient information look-up.

  • Enabled reference documents in list view for consistent access.

Reflections & Learnings

Stepping Into the User's Shoes

Working with IBM on this project was such a rewarding experience! It really underscored the importance of thoroughly understanding a user's existing technical workflow and stepping into their world—especially when the workflow is highly technical or unfamiliar.

With true design empathy, we were able to pivot and iterate our work towards what our users were already comfortable with, resulting in a more seamless integration and making adoption feel natural.

The hardest part was deprioritizing the collaboration tools we spent weeks fleshing out. Learning how to deprioritize features and "kill your darlings" to build something truly useful and ensure stakeholder alignment was a crucial lesson in design trade-offs!

Uncovering the Inner Workings of Enterprise AI

This was my first hands-on experience with large-scale AI implementation, and I really hit the ground running with InstructLab. Having the opportunity to touch base with so many experts across the watsonx.ai and SDG pipeline was invaluable to our research and fleshing out our designs. I feel like I finally got to make sense of such an abstract topic and seeing its tangible applications makes me eager to continue exploring the field of enterprise B2B AI.

Continued Design Exploration

There are a few secondary features based on stakeholder feedback that would push this design into a more advanced iteration that we would love to expand on in the future:

  • Build a scaleable platform with AI-augmentation to assist human experts in validating quality across massive datasets.

  • Dashboard for managers of reviewing teams to track reviewing statistics.

  • Metrics on the number of approved or denied datasets and confidence scores to evaluate overall synthetic data quality.

  • Role-based task assigning for teams to redirect questions to other members.

"This is a big step forward from what we've been doing in the past — a large Improvement!"

Jacob Engelbrecht

Backend Software Engineer @ IBM

Continued Design Exploration

There are a few secondary features based on stakeholder feedback that would push this design into a more advanced iteration that we would love to expand on in the future:

  • Build a scaleable platform with AI-augmentation to assist human experts in validating quality across massive datasets.

  • Dashboard for managers of reviewing teams to track reviewing statistics.

  • Metrics on the number of approved or denied datasets and confidence scores to evaluate overall synthetic data quality.

  • Role-based task assigning for teams to redirect questions to other members.

Reflections & Learnings

Stepping Into the User's Shoes

Working with IBM on this project was such a rewarding experience! It really underscored the importance of thoroughly understanding a user's existing technical workflow and stepping into their world—especially when the workflow is highly technical or unfamiliar.

With true design empathy, we were able to pivot and iterate our work towards what our users were already comfortable with, resulting in a more seamless integration and making adoption feel natural.

The hardest part was deprioritizing the collaboration tools we spent weeks fleshing out. Learning how to deprioritize features and "kill your darlings" to build something truly useful and ensure stakeholder alignment was a crucial lesson in design trade-offs!

Uncovering the Inner Workings of Enterprise AI

This was my first hands-on experience with large-scale AI implementation, and I really hit the ground running with InstructLab. Having the opportunity to touch base with so many experts across the watsonx.ai and SDG pipeline was invaluable to our research and fleshing out our designs. I feel like I finally got to make sense of such an abstract topic and seeing its tangible applications makes me eager to continue exploring the field of enterprise B2B AI.

How might we allow teams of reviewers to efficiently and collaboratively approve or deny sets of synthetic data?