Response to Vipul Kochar's experiment with LLMs

My background

To be very precise and helpful for this experiment I am explaining my background so that readers can weigh what they want to retain from this response as experience vs something as opinion. I have never worked on banking applications like finacle, so I will be of little help to validate test cases, yet I have done my studies in finance so I understand basic finance concepts. I come from vast experience in Testing Generative AI models much before ChatGPT wave started. We were building a summarisation model for summarising conversations in a startup where I was working during my last job. I come with a lot of Tech background although I work as a Tester my passion for building things has made me build some ML solutions as well as my pet projects.

Opinion on the Overall Approach taken by Vipul

After reading Vipul's post with multiple revisions multiple times. And some of the very good responses to it. I like the idea of Testing LLMs. What he intends to do is very much what we in the testers community should all do rather than relying on opinions posted in multiple forums. I am impressed with his curiosity. The little I know about Vipul since I don't know him personally but his LinkedIn profile says he hails from a Testing Background which makes me think about how he can miss this basic flaw in this experiment which I am going to talk about. As they say when we don't know the domain we don't jump on to solutions. Testing and Test Cases are kinds of solutions. As a tester when I approach a Testing assignment I try to know and learn as much as possible about software that we are making to solve a specific set of problems. I wish Vipul had done that as a first step before approaching an experiment especially if he is coming out with certain opinions after the experiment. Otherwise, my experiments will be neutral just stating facts of proceeders, inputs and outputs without any inferences. Because for inferencing we need knowledge of that specific domain to weigh the correctness of things. The second flaw I see with the experiment is he trying to come up with test cases. In my experience whatever applications I have tested however deep I know screens, functionality etc and with however elaboration I come up with test cases I will still have scope for unknowns. Which gets uncovered in experiential testing while interacting with the application. E.g. With how much precision we can persist interest rate this fact counts for some difference in amounts. for an amount of 1,000,000, an interest charge of 1.26% for 3 months amounts to 37800 and an interest charge of 1.255% for 3 months amounts to 37650, if the application allows precision up to only 2 decimal places then this difference will be prominent I may explore the above only after exploring the application with how much precision this is happening right now some may come up with this test case upfront depending on the knowledge they have with this domain

The nuances involved

Hallucination/Bias

After reading multiple test cases that this experiment results in I can see that Vipul has given some information to the model and the model is coming up with certain answers with certain things it already Knows without even Vipul providing it. Which looks fascinating at the surface but it has its own troubles.

- The model is Generating (Hence Generative) things based on the data it was fed, weights and biases it has formed during its training process. But in the real testing world when we call someone a Subject Matter Expert we can trust his/her knowledge to a certain degree based on accreditations and experience can we do it for LLMs? can we call LLMs Subject matter experts?
- The other issue is software undergoes revisions I am not sure which revisions data was fed into this model for the model to learn about this specific domain and which revision Vipul is asking it to make test cases for, the testing may differ based on this factor as well.

Yes Man tendency of LLMs

Despite If I had all the basic background of this specific domain if Vipul had asked me a similar question as a Tester he asked to LLM I might have expressed my inability to come up with an answer because I may have needed more specific details about What it means by Finacle? why we are building it? etc. I would have asked more context by asking appropriate questions and then I may have come up with answers. LLMs at present state don't have this capability of being at drivers sit for digging out more context and they will answer whatever they can answer with the set of training they had and with the set of limited context you have provided. This makes them very unreliable. They will only give up in extreme situations but otherwise, they will answer even if it may be completely wrong.

Rest of the other things

The rest of the inabilities of LLMs are very well tested and have been put to light by Michael Bolton and James Bach's work already so I am not repeating it. Whatever issues they have identified with them are valid in my opinion

My overall take

Most of my responses above are critical yet I see LLMs as a good progression and advancement in technology. I have been using them for certain tasks like rephrasing my blog posts to articulate my thoughts in better words (I make sure I make it very clear to them what they should retain and what they should rephrase). Yet I keep on weighing their responses in whichever tasks I delegate to them and final judgment is kept with me on what to accept and what not to accept.