Finding the bug without the knowledge of the system

Hi Ya’all,

I’m looking for advice that I can give a co-worker on how to fix or find a bug on a system that you don’t know how the internals work. This has always been a thing I’ve done, but thinking on it it is a weird problem. How do you approach a problem/bug that you don’t have a full understanding of how it works?

Thanks,
Jason Minters

1 Like

I usually do this
Get repro steps from bug reporter. Talk to the user, find out what they were doing. They might have an intuition about the cause. (Maybe they were doing something naughty?)
Don’t have repro steps? Spend time and find repro steps. Hopefully it is a bug that can be replicated easily.
Worst case scenario is that it doesn’t happen on your system. Figure out if your system is the outlier or the other persons system is the outlier.
Then try to figure out what is different about your system? Is it different hardware or different software?

Ok so now you have repro steps. Now start messing around until the bug goes away or changes. Make sure that you keep track of what you are messing with. Start removing complexity from the hardware and data. Did you try restarting the system? Got a wakom tablet? Disconnect it! Disconnect everything but the mouse and keyboard! :smiley:

Now start simplifying the data. Start deleting certain parts of the data and then do repro steps. Hopefully at some point you notice that things get fixed or change. Both are good signs that you just removed something that was causing the issue.

Ultimately you have to act like a sleuth. You are trying to find that needle in the haystack. Sometimes you have to use debugging tools like depends.exe. If your data can be exported in ascii format export the data with bug and without bug and use diff software to see what is different.

I can go on and on. But basically whittle the setup and data down until you see a change.

And most importantly don’t get angry when it’s something dumb like a full hard drive :smiley:
Be patient! Let your curiosity drive you! Know that someone is dependent on you and your victory is going to matter to someone or even a whole team!

Good luck yall!

2 Likes

The last resort to fixing a bug or any problem is decimation. Remove half and see if the bug is in this half or that. Then cut in half again and repeat. Like the fastest way to guess a number between 1-100 is say; higher or lower than 50, lower. Then, higher or lower than 25, etc. This is the fastest way to narrow in on the problem.

This works for rigs too. Remove as much of the rig as possible to work on just the area you’re fixing. Remove dependencies and clutter from the troubleshooting scene.

Swap and replace: To figure out why your TV isn’t streaming, swap out 1 piece of hardware or bandwidth system at a time to get the next clue to further investigate.

5 Likes

This answer is under the assumption that the bug/system isn’t completely new, but the specific area, or like you said the internals are an unknown.

I personally find it the most useful to ‘list what I do know’, ‘start at the beginning’, and ‘eliminate as many variables as possible’.

I’ll go through an example to show what I mean. Let’s say that… the bug is that a shader is acting strangely, everything looking too dark, and I don’t have any experience with the shader system at all. Like don’t even know what file to open.

‘List what I do know’:
This area is somewhat based in logic problems. How do you break the problem down the most foundational levels.

  1. There is a shader that exists.
  2. There is a new bug in this system that (arguably) wasn’t there before, so something has changed.

‘Start at the beginning’:
This is where you start expanding on what you know by asking questions. What are the pathways you can see from what you know to the answer. This usually breaks down to is there someone or documentation that can provide any clue. These usually can be mapped out directly in a diagram if you are a visual thinker.
Shader >> Who wrote shader? >> Did they change it recently? >> If yes, cool, they need to revert.
>> If not, continue finding clues.

  1. Do I know how the original author is? Would someone at the company know? Who would I ask? Usually it’s safe to start with a producer or manager.
  2. What file do I open? What are the things I’m capable of doing with the resources I have today? Can I glance through the folder structure or look through submission history to find something that seems related?

‘Eliminate as many variables as possible’
This one assumes that you have some clue from the previous questions, and I can break this one down into a simple checklist. Check the following:

  1. Has the content (or in this case the shader) changed? aka Look for the History.
  2. If not, what other systems tie into content or system I found through my clues? aka Look for the connections.
  3. If yes, who made the change and why? aka Look for the reasons.

This skillset can honestly be practiced very simply as thought experiments on any topic, not even tech art related. My sink is full of water that isn’t going away. How would you proceed? Every time you go through the process of simplifying the problem space, the easier the next problem will be to ‘break down’.

NOW, if the internals of the bug in question are completely opaque to you, and there is no possibility of learning about them. There is one additional layer to add, if you can’t see the inside, you have to figure out what to do with exclusively the inputs and outputs. That gets much trickier most of the time, but is essentially the same process as above with a lot more experimentation rather than actually understanding the system. A bit more ‘jiggle the handle and see if it works’ after using the above to attempt to eliminate as many inputs and outputs as possible to the problem.

1 Like

I find following the scientific method works best. It’s critical to have a clean environment and reduce noise much as possible. Test a theory and see if it works or not. Narrow down the results one small thing at a time. Even with the worst bugs like race conditions can be explained with this process.

2 Likes

Agreed with a lot of what people have posted about narrowing the problem space, using the scientific method, deleting things to try to find what’s causing it, etc. A lot of this is intuitive and you pick up patterns over time after doing this over and over.

One thing I haven’t seen listed yet is finding something that does work and figuring out what’s different about it. If one character has the bug, and another one doesn’t, what is different about them? Start comparing every property to figure out what could be different about them that could be causing the bug. You can do the same thing with shaders, meshes, etc.

I like this method because it helps you understand the scope of the problem. If all things are broken, it points to a systemic issue like a code change, if only one thing is broken, it’s most likely a data issue. This is a quick way to narrow down the root cause of an issue.

One common mistake I see people making is going right to some obscure root cause that they encountered once or heard about. You have to factor in the odds of something occurring. It is most likely not an operating system issue, or a hardware issue, or due to some piece of code that hasn’t been touched in 10 years. Start with the more likely causes, user error :stuck_out_tongue: , and rule those out first before you get to the more obscure ones.

2 Likes

I’m sure I’m gonna be beating a dead horse at this point, as everyone above has had such thoughtful comments. Some of y’all sparked memories of problems I’ve had in the past and how I solved them (@Count_Zr0 reminded me of a time I had an issue with a map cooker and in order to isolate the one bit of errant BSP I had to delete half the map, cook, and repeat until I found the problem area).

Maybe I spent too much time watching House, but for me this sort of thing usually comes down to doing everything I can to isolate the problem. If I’m trying to solve an issue that’s been brought up by a user, I’ll get an incomplete cause (IE “The exporter breaks every time I export”, even though they only ever export from one specific file, and that file in fact may contain a content error).

One of the things I really struggle with here is knowing when to switch into reverse and back up down a path I’ve taken to solve a problem. I ran into this just this week, actually. “When I do X, the draw distances of everything in my scene is reduced to 10% of what it should be”. Step 1 should usually be “can I reproduce this issue with the information provided in a completely clean environment”. The answer in this case was no, but I kept focusing on X without realizing how X interacted with Y system. It took me a while before I realized I’d gotten too deep down a bad path. I have to constantly remind myself to come up for air and re-evaluate my approach, especially when I’m in the weeds on something that I know is frustrating me. Sometimes I’ll even write down what I’m doing and what I’m testing, or rubber duck it (now that I’m typing it out, I’m reminding myself that I should do this more than I actually do)

If I’m trying to solve a bug with something and I have access to the source code, I often find it helpful to use whatever debuggers are available to try and find where Thing is happening (or not, as the case may be), and glean what I can from the available data. I’m not above throwing in dozens of print statements to see what’s happening, either.

At the end of the day, we as tech artists, deal with a lot of black boxes, and those are hard, especially when the person who wrote the black box doesn’t work at the company any more. I have to remind myself sometimes that “this was a good idea to someone at some point”, which helps me get in the right mindset to try and solve those kinds of problems.

At the end of the day, though, I think it comes down to making hypotheses, testing them in repeatable ways, and gathering meaningful data from those tests.

1 Like

Print statements are your friend. I use a type of binary search to find bugs as fast as possible. I start in the middle of the code(can be complex to find in hierachies), and then if it prints, I add another in the middle of the part after, else I add one to the middle of the part before.

The other, slower(but sometimes necessary), way is just to test the simplest and least complex piece, and then work your way up from there. Once you run into the problem, you will know what part it is dependant on.

Good luck!

2 Likes

What I normally do is first try to break down the possible systems that are all interacting at that point in time and from there I start to narrow the possibilities as to what could be causing it, just slowly working through the system backward eliminating anything that isn’t affecting it, once I narrow it down to the possible causes it’s just a matter finding out the uses of those systems and then its problem solving and testing from there. Also for me, it took a long time to just ask for help if I needed it, coworkers are an invaluable resource when it comes to unknown systems, there is a lot of legacy/tribal knowledge just floating around out there that someone might know.

1 Like

So, I figured this would be a good thing to actually break down since it’s literally happening to me in real-time. Here goes. I have some passing familiarity with the system in question, but I didn’t not write it nor have I had to do any maintenance to it as of yet.

1:14pm, I get a report that when scrubbing through Time of Day using the scroll wheel in the Developer Commentary game mode the screen gets all red n’ blue. It appears that the height fog is bluish, and static geo is reddish.

Other meetings, yadda yadda yadda

5:15pm - I can get back on to the problem. What I know right now is that it has something to do with the time of day system. I know our TOD system works by lerping between a BUNCH of values set at keyframes (keyframes are set in minutes between 0 and 1440). I know that we’ve got a modulo in place to prevent time values less than 0 and greater than 24 from happening. I add some text components to the time of day actor to see if I can see around when this weird shift occurs.

5:35pm - I can see that the issue occurs right around 1170 (again, just minutes, so this is equates to about 7:30pm) Interestingly, this is the highest value keyframe we have! Next step: to what key is this trying to blend.

(I go check on my wife, who is cooking dinner, and grab a fizzywater)

6:15pm - I add some watch values, and keep the blueprint window open while I run developer commentary mode again and scroll until the issue occurs. Next key is in the morning, at 480. Fantastic! It’s not trying to get some key that doesn’t exist and the system is sufficiently robust as to wrap around.

6:34pm - By now I have managed to reproduce the issue in the editor by change the time value from 19.0 to 20.0. I can toggle back and forth really quickly, so now my iteration loops are shorter. I know that the system affects the Directional Light, Sky Atmosphere, Exponential Height Fog, Post Process, and Skylight. I start turning each of those off and on until I get the issue to NOT occur. Nothing! I watch various values on those actors change as I flip back and forth, I reset the values to default. Nothing!

I look back at the Blueprint and notice an “Overlay PostProcess”. Weird, where is that actor? Not an actor! An unbound component on the Time of Day itself! New thing to watch, flip back and forth, disable the component. No more red and blue! Sweet, getting closer.

I expand a bunch of values in the postprocess settings, flip back and forth, toggle stuff until the issue stops. I notice that the blend weight for a postprocess material flips from 0.0 to 1.0 as I flip from 19.0 to 20.0. I open it up, and plug postprocessinput0 into emissive to nullify the material. No more red and blue! It wasn’t fog!

6:40pm - I start writing this post (dinner is starting to smell really good).

6:50pm - At this point I know that it’s SOMETHING in the postprocess material that adds a tint from near to far (which is why I thought it was the fog, stuff at a distance appeared blue). The background and foreground values don’t appear to change, or to be terribly extreme, but the background is vageuly blue, and the foreground is vaguely red.

Do the values on the postprocess material instance change? No. Distance, falloff, and the colors do not change.

Crack open the parent material, observe that the background and foreground are also extremely blue and red. This doesn’t change with Time of Day. (oh, the default values are more saturated, doi)

What happens if I change the color values on the mat instance?

Nothing, I switched both foreground and background to green and it’s still Red -> Blue.

Back up, double-check to see if there’s anything about creating a dynamic instance, or if there are settings for this material stored in the keyframes. Nope.

Changing the distance does work. Falloff doesn’t. Revert the mat instance.

(Wife brought dinner, it is delicious)

Back to the material. It uses a Blend_Overlay to drop the color onto scene color. I know from years of photoshop that Overlay does weird stuff when the base layer is dark. Swap it for a multiply. Colors are still super saturated, but it changed something, so at least I (I think) I know I’m changing the right thing.

Swap the multiply for a Lerp(SceneColor, OverlaidSceneColor, MyWeightValue).

Colors look a lot better at Weight = 0, so at Weight = 1 they should be fully saturated, right?

WRONG.

WHAT!?

Back up a second… SceneTexture:PostProcessInput0 outputs a float4, and the lerped depth for scene color is a float3. You can’t lerp or multiply floats3 and 4, so I had to use a component mask to get the arithmetic to work right. Plug the .rgb input into the blend overlay and skip my manual weight lerp altogether. Super saturated values in the material viewport (expected). Super saturated values in the game viewport as well. Dangit.

Re-enable the weight lerp, but with a default value of 1. Same thing, super saturated. I don’t like this one bit.

Weight = 0 by default, override in the instant to 1, desired results.

Back up, verify that the depth lerp is working as expected, output that straight to emissive. Super saturated in the game viewport, appropriately desaturated in the instance viewport. Oh no…

Change the default color values. That works. Change them in the instance? No effect in the game viewport.

Reset the material to default, have a think. At a time value of 20, the scene postprocess volume currently shows the blend weight of the PP_Depth_Inst to be 0, the TOD’s PostProcess component shows a weight of 1. I can’t change this. The overall blend weight of the TOD postprocess component is 0.04.

(It is at this point, I begin to consider that there may be a Bug afoot)

Is it possible that the postprocess material array isn’t referencing the instance at all, but is in fact referencing the base material? Clear the reference and reapply it. I cannot! I suspect I’m blocked by the TOD update script. Go check the settings, since postprocess blend weights and materials are stored in the keyframes.

I can nuke the material from those keyframes and reapply, I see this update on the component. This does not improve the situation. I change the post process material’s blend weight to 0.5, nothing. 0.01? Nothing. 0.0, good effect. Start advancing through the time of day cycle.

H*ck, now it’s blue at noon. Start checking the PPM arrays at all the other times of day, notice that in some keyframes the ORDER is different fix that, set the weights back where they should be. (Can’t hurt, right?) Cycle through the day. Back to square 1.

Okay, so at 1110 and 1170 the blend weights are 0 but are 1 at each of the other keyframes. Maybe something’s wrong there, set those to 1.

That fixed it! And it didn’t seem to have a negative effect on the times between 1110 and 1170.

WHY DID THAT FIX IT?!

Okay, go back, change the values in the base material to the instance’s values and revert the blend weights to 0.0 for 1110 and 1170. Also solved.

At this point, I am thinking that I have encountered A Bug. Is the bug in the Time of Day script or in the postprocess… stuff? At this point, I believe I’ve found a Content solution (set the default material values to the desaturated ones that are more desirable) and I can comfortably move ahead with it and I’ll try and isolate the specific issue at a later. My hypothesis is like “Postprocess material blend weight = 0.0, the system uses the parent instead”.

Test the time of day scrubbing in developer commentary mode, looks good to go.

Run through the level one more time just to make sure I didn’t goof anything.

I ALMOST FORGOT TO DISABLE MY DEBUG ACTOR.

Phew, what a ride.

2 Likes

Some really great suggestions here! I read a book on debugging once and some of the advice in there has served me well over the years.

Many of the approaches I use have been mentioned but here are some rules I live by which can be applied to almost any problem:

  1. Reproduce the problem (make the system fail)
  2. Challenge your assumptions (“print that variable”, “check it’s plugged in”)
  3. Simplify the environment (turn things off and isolate the problem)
  4. If all else fails, binary search
  5. If you didn’t find the problem, it’s not fixed. (seriously!)
1 Like