Fast search through text file


#81

line, lines all around, feel bad at naming things :slight_smile:

        public class Line
		{
			public int count;
			public List<string> lines;
			public List<int> indexes;
			
			public Line()
			{
				count = 0;
				lines   = new List<string>(){};
				indexes = new List<int>();
			}
			
			
			public Line( string line, int index )
			{
				count   = 1;
				lines   = new List<string>(){ line };
				indexes = new List<int>( index );
			}
			
			public void AddLine( string line, int index )
			{
				lines.Add( line );
				indexes.Add( index );
				count++;
			}
		}

		public class TextProcessor
		{
			public Dictionary<string, Line> data;
			public string[] keys;
			public Line[] lines;
			
			public void ProcessFile(string file , string[] words)
			{	
				data = new Dictionary<string, Line>();
				
				var spaces = new char[]{' ','	','\r','\n'};
				string[] lines = File.ReadAllLines(file);
				if (lines != null)
				{
					int line_index = 0;
					
					string word;
					
					foreach(string line in lines)
					{
					 	line_index++;
						
						// #1
						//word = line.TrimStart(spaces).Split()[0].ToLower();
						
						// #2
						word = line.TrimStart(spaces);
						int len  = word.IndexOfAny( spaces );
						word = word.Substring( 0, len < 0 ? word.Length : len ).ToLower();
						
						// #3
						//var chars = line.ToCharArray();						
						//int f1 = Array.FindIndex(chars,     x => !char.IsWhiteSpace(x));						
						//if ( f1 < 0 ) continue;						
						//int f2 = Array.FindIndex(chars, f1, x => char.IsWhiteSpace(x));
						//word = line.Substring( f1, f2 - f1 ).ToLower();
						
												
						if (data.ContainsKey(word))
						{
							data[word].AddLine( line, line_index );
						}
						else
						{
							data[word] = new Line( line, line_index );						
						}
					}
				}
				keys = new string[data.Keys.Count];
				data.Keys.CopyTo(keys, 0);				
				this.lines = new Line[data.Values.Count];
				data.Values.CopyTo(this.lines, 0);
			}
			
		}

#82

Strange. indexes are less than the other collected data. The first line index is not collected.

tp.keys[1]: beta

tp.lines[1].count: 1856
tp.lines[1].count: 1856

tp.lines[1].indexes.count: 1855

tp.lines[1].lines.count: 1856

I am using this inside your code to filter only the passed words:

if (Array.IndexOf(words, word) > -1)
						{	
							if (data.ContainsKey(word))
							{
								data[word].AddLine( line, line_index );
							}
							else
							{
								data[word] = new Line( line, line_index );						
							}
						}

#83

don’t do this.
instead initialize the dict with known words and add lines info only in case of data.ContainsKey(word)

...
foreach( var w in words ) data[w] = new Line() // preinitializing the dict
...

#84

I still don’t understand why that’s the case, but changing to this solves the issue
upd
found it at last, was putting index inside parens and not inside curlies

            public Line( string line, int index )
			{
				count   = 1;
				lines   = new List<string>(){ line };
				indexes = new List<int>(){ index };
			}

#85

This way works. Thank you.

public void ProcessFile(string file , string[] words)
			{	
				data = new Dictionary<string, Line>();				
				foreach( var w in words ) data[w] = new Line();

#86

So, which method is preferable - to preinitialize the dict or to use your last update?


#87

depends on how do you plan to use it.
if you’re sure that words that aren’t in the list shouldn’t be collected why adding them? go with the preinitialized dict
it is all about the performance
if there’s no much difference collect everything and then retrieve from the dict by the key, then of course you’ll have to keep else statement


just to be clear, if you go with the predifined dict you won’t have ---- key in the dict after you process the file.
.

upd
this one, must be the fastest version so far. ~45ms

code
public class Line
{
	public int count = 0;
	public List<string> lines;
	public List<int> indexes;
	
	public Line()
	{
		count = 0;
		lines   = new List<string>(){};
		indexes = new List<int>();
	}
	
	
	public Line( string line, int index )
	{
		count   = 1;
		lines   = new List<string>(){ line };
		indexes = new List<int>(){ index };
	}
	
	public void AddLine( string line, int index )
	{
		lines.Add( line );
		indexes.Add( index );
		count++;
	}
}

public class TextProcessor
{
	public Dictionary<string, Line> data;
	public string[] keys;
	public Line[] lines;
	
	public void ProcessFile(string file , string[] words)
	{	
		data = new Dictionary<string, Line>();
		
		//foreach( var w in words) data[w] = new Line();
		
		var spaces = new char[]{' ','	','\r','\n'};
		string[] lines = File.ReadAllLines(file);

		if (lines != null)
		{
			int line_index = 0;
			
			string word;
			
			foreach(string line in lines)
			{
				line_index++;
				
				// #1
				//word = line.TrimStart(spaces).Split()[0].ToLower();
				
				// #2
				//word = line.TrimStart(spaces);
				//int len  = word.IndexOfAny( spaces );
				//word = word.Substring( 0, len < 0 ? word.Length : len ).ToLower();
				
				// #3
				//var chars = line.ToCharArray();						
				//int f1 = Array.FindIndex(chars,     x => !char.IsWhiteSpace(x));						
				//if ( f1 < 0 ) continue;						
				//int f2 = Array.FindIndex(chars, f1, x => char.IsWhiteSpace(x));
				//word = line.Substring( f1, f2 - f1 ).ToLower();
				
				// #4						
				int f1 = -1;
				for ( int i = 0; i < line.Length; i++ )
				{
					if ( !char.IsWhiteSpace(line[i]) )
					{
						f1 = i;
						break;
					}
				}
				
				if ( f1 < 0 ) continue;
				int f2 = -1;
				for ( int i = f1; i < line.Length; i++ )
				{
					if ( char.IsWhiteSpace(line[i]) )
					{
						f2 = i;
						break;
					}
				}
				word = line.Substring( f1, f2 - f1 ).ToLower();
										
				if (data.ContainsKey(word))
				{
					data[word].AddLine( line, line_index );							
				}
				else
				{
					data[word] = new Line( line, line_index );							
				}
			}
		}
		keys = new string[data.Keys.Count];
		data.Keys.CopyTo(keys, 0);				
		this.lines = new Line[data.Values.Count];
		data.Values.CopyTo(this.lines, 0);
	}
	
}

#88
// #1.5
var word = line.TrimStart().Split(null, 2)[0].ToLower();

#89

Yep, the last one is the fastest.

When I make it to work with string, for some files there is an error:
– Runtime error: .NET runtime exception: Length cannot be less than zero.

Found why. If a line starts with:

/

or with:
----
there is an error.

The same code here works with no errors: https://dotnetfiddle.net/9KqZve

The times I have with different versions:

  1. pure maxscript: 0.68 sec
  2. first C# version with regEx: 0.33 sec.
  3. last C# version: 0.26 sec – including time to fill maxscript arrays with the data. Only C# code is 0.05 sec

#90

well, I doubt that this test file that you provided is a good test case to ensure that the solution is bug-free.

change this line like that and it will work
int f2 = line.Length;

.

nah, still two times slower. ~100ms


#91

Fixed, thank you.

For now everything works as expected - fast and accurate.


#92

cool, perhaps there’s nothing left to optimize
even the latest .Net 7 version shows 28ms which isn’t far away from 40+ms on my device

curious, what is it all for? some kind of obfuscator / source file analyzer?


#93

what a bore you are! :stuck_out_tongue_winking_eye:

			public static string GetFirstWord(string line)
			{
				int i = 0;
				while (i < line.Length && char.IsWhiteSpace(line, i)) { i++; }
				int k = i;
				while (k < line.Length && !char.IsWhiteSpace(line, k)) { k++; }
				return line.Substring(i, k-i);
			}

btw… on my machine the difference for pure searching and using trim and split is 25 vs 40. So it’s not a big deal. :wink:


#94

shouldn’t have said that :rofl:

nice, sometimes it drops even below 40ms


#95

To make navigating in scripts more user friendly. :slight_smile:
You know that Ctrl+RMB click gives you a menu where you can see the controls, functions, events, etc. used in the current script and you can go whenever you want by clicking the desired item.
When I see whole screen(literally) full with items to click, and there are more items not shown… not an easy task to find what you need.
So, I created this:

For the currently opened script, the [ </> ] shows, buttons for all available controls + local and global vars + struct names + functions + lines where I have format and print. Clicking a button shows the data. Clicking a row in the list shows the line number of the selected text, double clicking a row selects the same line in the MaxScript editor and make it visible, so navigation is easy and fast.
The button with the bookmark icon shows all bookmarks for the current script. The navigation is the same - click a row in the list to go to desired bookmark.
The [ /* ] button shows all comments which starts with --. The same way of navigation.
With the filter box I can find what I need much faster.
And most importantly, with your help and the help of the Denis collecting the data is much faster than Ctrl+RMB click inside the MaxScritp Editor.

What I have in my ToDo list:

  • collect variables inside structs
  • find a faster way to populate the listview. It takes more than 1,5 sec to fill it with 4000+ items.

#96

looks really good. must be very convenient to use for anyone who is too lazy to switch to vscode or similar.

listview supports virtualization, maybe that’s the way to go. (never used it myself)

did you mean struct properties?
it shouldn’t be a big deal, find a struct definition, then find an opening ( and closing ) and tokenize everything what’s inside that range

code

just the idea

...
re_options_m = dotNet.combineEnums (dotNetClass "System.Text.RegularExpressions.RegexOptions").MultiLine (dotNetClass "System.Text.RegularExpressions.RegexOptions").IgnoreCase,
	
...
				
--  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
--		2. Collect struct defs
--  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
(
local pattern_struct = "\bstruct\s(\w+)\b"
matches = (dotNetClass "System.Text.RegularExpressions.RegEx").Matches code pattern_struct re_options_m

for i = 0 to matches.count - 1 do 
(
	local item        = matches.item[i].groups.item[1]
	local match_start = 1 + item.index
	local match_end   = match_start + item.value.count
	
	if valid[ match_start ] and valid[ match_end ] do
	(
		append STRUCT_TOKENS ( Token type:"struct" value:item.value start:match_start end:match_end )				
	)			
)

#97

Thank you.
I use vscode for Pyhton and Powershell, but I can’t force myself to use it(along with notepad++) for maxscript. :slight_smile:

Thank you. Will check it in the coming days. First I have to “fix and arrange” the code of the whole script.

Yes. The functions, defined inside structs, are collected by the script, but the variables does not start with local or global, so they are not collected.


#98

but they either go after the comma , or struct opening ( or before the assignment = or the next comma or closing ) parenthesis. Looks not so complicated


#99

TrimStart of course

var word = line.TrimStart().Split(null, 2)[0].ToLower();

#100

Strangely, I see no difference at all, both trim variants complete at around ~100ms, 26ms is file read
If it is the fastest for you, then guess it could be either CLR version differences (compiler optimizations) or the hardware i.e. modern cpu instructions etc.

trim vs getfirstword

trim
%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5
getfirstword
%D0%B8%D0%B7%D0%BE%D0%B1%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5