Fast search through text file


#1

Hi, guys and Happy New Year! :wink:

I am trying to go as fast a possible with this task - find words in string(text file).
This code snippet will generate a string(imitating text file) with 50 000 lines(the number is correct).

(
	gc()
	newLineStr = "\n"
	newTabStr = "\t "	
	--( 	generate string
	space = " "
	nl = "\n"
	tab = "\t"
	tabNl = "\t\n"
	fmt = "%\n"
	
	wordsArr = #("alpha", "beta", "gama", "delta", "Epsilon", "Zeta", "Eta", "Theta", "Iota", "kaPPa", "LamBda", "mU", "Nu", "xi", "omicron", "pi", "rHo", "siGma", "Tau", "UpSiLoN", "pHi", "chi", "psi", "omega")
	ss = stringStream ""
	for i = 1 to 50000 do
	(
		wordsCnt = random 10 20
		wArr = for j = 1 to wordsCnt collect (wordsArr[random 1 24])
		tabCnt = random 0 5
		
		str = ""
		if mod i 17 == 0 then
			str = tabNl
		else
		(
			if mod i 33 == 0 then
				str = nl
			else
			(
				for t = 1 to tabCnt do str += tab
				
				for w in wArr do str += space + w
			)
		)
		
		format fmt str to:ss
	)
	--)
	
	
	strToCheck = toLower (ss as string)
		
	gc()
	t0 = timestamp()
	
	fStrArr = (filterString strToCheck newLineStr)
	
	t1 = timestamp()
	format "filterString %  sec.\n" ((t1-t0)/1000.0)
	
	format "Lines: %\n" fStrArr.count
	
	
	_kappa = "kappa"
	_omicron = "omicron"
	_upsilon = "upsilon"	
	wordToFindArr = #(_kappa, _omicron, _upsilon)	
	
	
	
	kappaArr = #()
	omicronArr = #()
	upsilonArr = #()
	
	gc()
	t0 = timestamp()
	
	for j = 1 to fStrArr.count where (stringArr = (filterString fStrArr[j] newTabStr)).count != 0 do
	(
		str1 = stringArr[1]
		stopLoop = false				
		for i in wordToFindArr where i == str1 while stopLoop == false do
		(
			case str1 of
			(
				--	"collect line number"
				_kappa:
				(
					append kappaArr j
					stopLoop = true
				)
				--	"collcet the line text"
				omicronArr:
				(
					append omicronArr (trimLeft fStrArr[j] space)
					stopLoop = true
				)
				--	"collcet the line text"
				upsilonArr:
				(
					append upsilonArr (trimLeft fStrArr[j] space)
					stopLoop = true
				)
			)
		)
	)
	
	t1 = timestamp()
	format "Find words %  sec.\n" ((t1-t0)/1000.0)
	
	format "kappaArr: %\n" kappaArr.count
	format "omicronArr: %\n" omicronArr.count
	format "upsilonArr: %\n" upsilonArr.count
)

You can see what the script has to do - find each line of the strToCheck which starts with one of predefined words and collect some data.

With the code above I have:

Find words 0.285 sec.

if I remove the case str1 of statement the time is almost the same.

With the real text file I am using the time to find all predefined words is about: 0.7 sec.

If I use only python(outside 3ds max) the time to do the same is about 0.18 sec.

Is there are any way to make maxscript to perform the task faster?
The collected data will be used to fill a dotnet ListView.


#2

fastest way would be to use regex in multiline mode so you don’t have to split the text into separate lines.
if you have to iterate over each line anyway then use default singleline mode to check if the line matches the pattern with Match method. See examples section

there were some examples posted on forum
if you need to combine several regex modes, you can do it with dotnet.combineEnums

in your case the pattern for multiline mode would look like this:
regex_pattern = "^(kappa|omicron|upsilon)\b"
where ^ is the char that requires the match to be in the beginning of the line and | char is simply the OR operator, and \b - the word boundary. So any line starting with one of these words will return a match

I’d love to post an mxs example, but my laptop’s screen suddenly got broken andd this device has no max installed


#3

Thank you, Serejah!

I have spent several hours searching the net for a way to get the line number when regEx is used(and the whole line’s text), but I had no success. In some cases I need to know on which line the words are written, but I could not find how to do this using regEx.


#4

in this case you probably better iterating over the lines array so the line number is always known
although SO has some answers for similar task

upd
how long would this code take compared to the above?


  line_num = 1
  
  for line in fStrArr do
  (
      if matchpattern line pattern:"kappa*" then
      (
      		append kappaArr line_num 
      )
      else      
      (
      	  line = trimLeft line space
          
          case of
          (
      			(matchpattern line pattern:"omicron*") :  append omicronArr line
      			(matchpattern line pattern:"upsilon*") :  append upsilonArr line
          )
      ) 
    
      line_num += 1
  )

#5

Your code: 0.142
Mine: 0.280

Update:

Using your approach:
speed increases from 0.8 to 0.65 sec

Searching single word in 50000 lines the time went down from 2.62 sec to 2.46 sec(for 4407 occurrences).

For comparison in python the same search takes 0.088 sec for 4407 occurrences.

Still not as fast as I need. :slight_smile:


#6

look here, I’ve ported the code to do a quick test
https://dotnetfiddle.net/VLE9LB

pure c# version should do the job

Multiline:
5652
Time:293 ms. – could be improved with RegexOptions.Compiled option. ~200ms best time

Line by line:
5652
Time:65 ms.

After looking at the input string I realized that the pattern must be different and include"^\\s*... in the beginning to match lines that start with the whitespaces


#7

The C# code does not collect any data.
Does it know which word is found on the processed line?

RegEx in maxscript, in the way I use it, is not a solution :slight_smile:

(
	gc()
	newLineStr = "\n"
	newTabStr = "\t "
	trimLeftStr = " \t"
	--( 	generate string
	space = " "
	nl = "\n"
	tab = "\t"
	tabNl = "\t\n"
	fmt = "%\n"
	
	wordsArr = #("alpha", "beta", "gama", "delta", "Epsilon", "Zeta", "Eta", "Theta", "Iota", "kaPPa", "LamBda", "mU", "Nu", "xi", "omicron", "pi", "rHo", "siGma", "Tau", "UpSiLoN", "pHi", "chi", "psi", "omega")
	seed 123
	ss = stringStream ""
	for i = 1 to 50000 do
	(
		wordsCnt = random 10 20
		wArr = for j = 1 to wordsCnt collect (wordsArr[random 1 24])
		tabCnt = random 0 5
		
		str = ""
		if mod i 17 == 0 then
			str = tabNl
		else
		(
			if mod i 33 == 0 then
				str = nl
			else
			(
				for t = 1 to tabCnt do str += tab
				
				for w in wArr do str += space + w
			)
		)
		
		format fmt str to:ss
	)
	--)
	
	
	strToCheck = toLower (ss as string)
		
	gc()
	t0 = timestamp()
	
	fStrArr = (filterString strToCheck newLineStr)
	
	t1 = timestamp()
	format "filterString %  sec.\n" ((t1-t0)/1000.0)
	
	format "Lines: %\n" fStrArr.count
	
	
	_kappa = "kappa"
	_omicron = "omicron"
	_upsilon = "upsilon"	
	wordToFindArr = #(_kappa, _omicron, _upsilon)	
	
	kappaArr = #()
	omicronArr = #()
	upsilonArr = #()
	
	RE_Match   = (dotnetclass "system.text.regularexpressions.regex").match
	RE_Pattern_Omicron = "^\\s*(omicron)\\b"
	RE_Pattern_Kappa = "^\\s*(kappa)\\b"
	RE_Pattern_Upsilon = "^\\s*(upsilon)\\b"
	
	gc()
	t0 = timestamp()
	
	j = 1
	for str1 in fStrArr do
	(
		if (RE_Match str1 RE_Pattern_Omicron).Success then
		(
			append kappaArr j
		)
		else
		(
			if (RE_Match str1 RE_Pattern_Kappa).Success then
			(
				append omicronArr j
			)
			else
			(
				if (RE_Match str1 RE_Pattern_Upsilon).Success do
				(
					append upsilonArr j
				)
			)
		)
		j += 1
	)
	
	t1 = timestamp()
	format "Find words %  sec.\n" ((t1-t0)/1000.0)
	
	format "kappaArr: %\n" kappaArr.count
	format "omicronArr: %\n" omicronArr.count
	format "upsilonArr: %\n" upsilonArr.count
	
)

0.9 sec vs 0.14 sec for matchPattern.
:slight_smile:


#8

Hard to help you without 3dsmax running, but here’s a bit optimized c# version, 40-50ms
Try porting it to mxs, it shouldn’t be complicated
https://dotnetfiddle.net/zrQ5r3

add regex options as in c# source to your mxs code, it should improve the performance


combine them with dotnet.combineenums and you’re good to go


#9

If you want performance try moving loop iteration to c#
Make a dynamically compiled c# dll with a class and a method that would take a string or filepath as an input parameter and fill the arrays with the data
after process is complete you can access arrays to get the values to maxscript
smth like that

public class DocProcessor
	{
		List<int> _kappa = new List<int>();
		List<string> _omicron = new List<string>();
		List<string> _upsilon = new List<string>();
		
		public int[] kappa { 
			get 
			{ 
				return _kappa.ToArray(); 
			}
		}
		
		public string[] omicron { 
			get 
			{ 
				return _omicron.ToArray(); 
			}
		}
		
		public string[] upsilon { 
			get 
			{ 
				return _upsilon.ToArray(); 
			}
		}
		
	
		public void ProcessDocument( string doc )
		{
			/* pseudocode
			// clear all the lists before the start
			
			line_index = 0
			for each line in doc
				new_line = trim spaces from the beginning

				if kappa is match _kappa.add( line_index )
				else
				if omicron is match _omicron.add( new_line )
				else
				if upsilon is match _upsilon.add( new_line )

				line_index++
			*/
		}
	}
}

then in mxs

(
dp = dotnetobject "DocProcessor"
dp.ProcessDocument doc_string
kp = dp.kappa
print kp.count
)

not tested


#10

Thank you.
With my zero C# knowledge this is not compiled at all:

(
	fn CreateArrAssembly =
	(

		source = ""
		source += "using System;"
		source += "using Text;"
		source += "using System.Collections.Generic;"
		source += "System.Text.RegularExpressions;"
		source += "	public class DocProcessor"
		source += "	{"
		source += "		List<string> _kappa = new List<string>();"
		source += "		List<string> _omicron = new List<string>();"
		source += "		List<string> _upsilon = new List<string>();"
		
		source += "		List<int> _kappaIdx = new List<int>();"
		source += "		List<int> _omicronIdx = new List<int>();"
		source += "		List<int> _upsilonIdx = new List<int>();"

		source += "		public string[] kappa {"
		source += "			get"
		source += "			{"
		source += "				return _kappa.ToArray(); "
		source += "			}"
		source += "		}"

		source += "		public string[] omicron {"
		source += "			get"
		source += "			{"
		source += "				return _omicron.ToArray(); "
		source += "			}"
		source += "		}"

		source += "		public string[] upsilon {"
		source += "			get"
		source += "			{"
		source += "				return _upsilon.ToArray(); "
		source += "			}"
		source += "		}"
		
		source += "		public int[] kappaIdx {"
		source += "			get"
		source += "			{"
		source += "				return _kappaIdx .ToArray(); "
		source += "			}"
		source += "		}"

		source += "		public int[] omicronIdx  {"
		source += "			get"
		source += "			{"
		source += "				return _omicronIdx .ToArray(); "
		source += "			}"
		source += "		}"

		source += "		public int[] upsilonIdx  {"
		source += "			get"
		source += "			{"
		source += "				return _upsilonIdx .ToArray(); "
		source += "			}"
		source += "		}"

		source += "		var reOmicron = new Regex( \"^omicron\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );"
		source += "		var reKappa = new Regex( \"^kappa\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );"
		source += "		var reUpsilon = new Regex( \"^upsilon\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );"
		
		source += "		var spaces = new char[]{' ','	'};"

		source += "		public void ProcessDocument( string doc )"
		source += "		{"
		source += "			line_index = 0;"
		source += "			for each line in doc"
		source += "			{"
		source += "				line_index++;"
		source += "				lline = line.TrimStart( spaces );"
		source += "				if ( reOmicron.IsMatch( lline ) )"
		source += "				{"
		source += "					_omicron.add( line );"
		source += "					_omicronIdx.add( line_index );"
		source += "				}"
		source += "				else"
		source += "				if ( reKappa.IsMatch( lline ) )"
		source += "				{"
		source += "					_kappa.add( line );"
		source += "					_kappaIdx.add( line_index );"
		source += "				}"
		source += "				else"
		source += "				if ( reUpsilon.IsMatch( lline ) )"
		source += "				{"
		source += "					_upsilon.add( line);"
		source += "					_upsilonIdx.add( line_index );"
		source += "				}"
		source += "			}"
		source += "		}"
		source += "	}"

		csharpProvider = dotnetobject "Microsoft.CSharp.CSharpCodeProvider"
		compilerParams = dotnetobject "System.CodeDom.Compiler.CompilerParameters"

		compilerParams.ReferencedAssemblies.AddRange #("System.dll")

		compilerParams.GenerateInMemory = on
		compilerResults = csharpProvider.CompileAssemblyFromSource compilerParams #(source)
		assembly = compilerResults.CompiledAssembly
		assembly.CreateInstance "DocProcessor"
	)

	global DocProcessorA = CreateArrAssembly()

)

Playing with your code (https://dotnetfiddle.net/vshfOl) :slight_smile:


#11

no surprise, in order to iterate using foreach you need to split the doc into lines array
choose something from this SO thread

alternatively you could simply read the doc line by line and do what you want. this must be even faster

using (StringReader sr = new StringReader(text)) {
    string line;
    while ((line = sr.ReadLine()) != null) {
        // do something
    }
}

#12

I execute only the code I posted above and there is an error:

-- Runtime error: .NET runtime exception: Could not load file or assembly 'file:///C:\Users\XXXXXX\AppData\Local\Temp\xi25bkge.dll' or one of its dependencies. The system cannot find the file specified.

If the c# syntax is correct it should build the dll and to load it, and if there is an erorr in the code it has to thrown an error when I use it, or I am wrong?


#13

The thing is that it shouldn’t compile at all unless you commented out everything inside ProcessDocument method.
This one should compile: https://dotnetfiddle.net/7kNr05

add this to your code and see what errors it prints

...
        compilerResults = csharpProvider.CompileAssemblyFromSource compilerParams #( source )

		if (compilerResults.Errors.Count > 0 ) then
		(
			local errs = stringstream ""
			for i = 0 to (compilerResults.Errors.Count-1) do
			(
				local err = compilerResults.Errors.Item[i]
				format "Error:% Line:% Column:% %\n" err.ErrorNumber err.Line err.Column err.ErrorText to:errs
			)
			format "%\n" errs
			undefined
		)
		else
		(
			compilerResults.CompiledAssembly.CreateInstance ". . ."			
		)

#14

This is the end of the code:

source += "	}"

		csharpProvider = dotnetobject "Microsoft.CSharp.CSharpCodeProvider"
		compilerParams = dotnetobject "System.CodeDom.Compiler.CompilerParameters"

		compilerParams.ReferencedAssemblies.AddRange #("System.dll")

		compilerParams.GenerateInMemory = on
		compilerResults = csharpProvider.CompileAssemblyFromSource compilerParams #(source)
		
		if (compilerResults.Errors.Count > 0 ) then
		(
			local errs = stringstream ""
			for i = 0 to (compilerResults.Errors.Count-1) do
			(
				local err = compilerResults.Errors.Item[i]
				format "Error:% Line:% Column:% %\n" err.ErrorNumber err.Line err.Column err.ErrorText to:errs
			)
			format "%\n" errs
			undefined
		)
		else
		(
			compilerResults.CompiledAssembly.CreateInstance ". . ."			
		)
		
		assembly = compilerResults.CompiledAssembly
		assembly.CreateInstance "DocProcessor"
	)

And this is the error message:


StringStream:"Error:CS0116 Line:1 Column:58 A namespace cannot directly contain members such as fields or methods
"
-- Error occurred in anonymous codeblock; filename: C:\Temp\cSharp_Test_01.ms; position: 3890; line: 120
-- Runtime error: .NET runtime exception: Could not load file or assembly 'file:///C:\Users\XXXX\AppData\Local\Temp\c4f51gay.dll' or one of its dependencies. The system cannot find the file specified.
-- MAXScript callstack:
--	thread data: threadID:2704
--	------------------------------------------------------
--	[stack level: 0]
--	In CreateArrAssembly(); filename: C:\Temp\cSharp_Test_01.ms; position: 3891; line: 120
--	member of: anonymous codeblock
--		Locals:
--			compilerParams: dotNetObject:System.CodeDom.Compiler.CompilerParameters
--			compilerResults: dotNetObject:System.CodeDom.Compiler.CompilerResults
--			source: "using System;using Text;using System.Collections.Generic;System.Text.RegularExpressions;	public class DocProcessor	{		List<string> _kappa = new List<string>();		List<string> _omicron = new List<string>();		List<string> _upsilon = new List<string>();		List<int> _kappaIdx = new List<int>();		List<int> _omicronIdx = new List<int>();		List<int> _upsilonIdx = new List<int>();		public string[] kappa {			get			{				return _kappa.ToArray(); 			}		}		public string[] omicron {			get			{				return _omicron.ToArray(); 			}		}		public string[] upsilon {			get			{				return _upsilon.ToArray(); 			}		}		public int[] kappaIdx {			get			{				return _kappaIdx .ToArray(); 			}		}		public int[] omicronIdx  {			get			{				return _omicronIdx .ToArray(); 			}		}		public int[] upsilonIdx  {			get			{				return _upsilonIdx .ToArray(); 			}		}		var reOmicron = new Regex( "^omicron\b", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );		var reKappa = new Regex( "^kappa\b", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );		var reUpsilon = new Regex( "^upsilon\b", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );		var spaces = new char[]{' ','	'};		public void ProcessDocument( string doc )		{			line_index = 0;			for each line in doc			{				line_index++;				lline = line.TrimStart( spaces );				if ( reOmicron.IsMatch( lline ) )				{					_omicron.add( line );					_omicronIdx.add( line_index );				}				else				if ( reKappa.IsMatch( lline ) )				{					_kappa.add( line );					_kappaIdx.add( line_index );				}				else				if ( reUpsilon.IsMatch( lline ) )				{					_upsilon.add( line);					_upsilonIdx.add( line_index );				}			}		}	}"
--			errs: StringStream:"Error:CS0116 Line:1 Column:58 A namespace cannot directly contain members such as fields or methods
"
--			csharpProvider: dotNetObject:Microsoft.CSharp.CSharpCodeProvider
--			assembly: undefined
--		Externals:
--			owner: <CodeBlock:anonymous>
--	------------------------------------------------------
--	[stack level: 1]
--	called from anonymous codeblock; filename: C:\Temp\cSharp_Test_01.ms; position: 3984; line: 124
--		Locals:
--			CreateArrAssembly: CreateArrAssembly()
--		Externals:
--	------------------------------------------------------
--	[stack level: 2]
--	called from top-level


#15

StringStream:"Error:CS0116 Line:1 Column:58 A namespace cannot directly contain members such as fields or methods

you can always read the reason behind the error on msdn
.

source += “System.Text.RegularExpressions;”

should be

source += “using System.Text.RegularExpressions;”


#16

Thank you.
I updated the code as you suggested. Then there was errors that ‘var’ can’t be used so I removed the var from the code, and not I have this

StringStream:"Error:CS1519 Line:1 Column:848 Invalid token '=' in class, struct, or interface member declaration
Error:CS1520 Line:1 Column:854 Method must have a return type
Error:CS1031 Line:1 Column:861 Type expected
Error:CS1519 Line:1 Column:905 Invalid token '|' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:931 Invalid token '|' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:955 Invalid token ')' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:967 Invalid token '=' in class, struct, or interface member declaration
Error:CS1520 Line:1 Column:973 Method must have a return type
Error:CS1031 Line:1 Column:980 Type expected
Error:CS1519 Line:1 Column:1022 Invalid token '|' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:1048 Invalid token '|' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:1072 Invalid token ')' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:1086 Invalid token '=' in class, struct, or interface member declaration
Error:CS1520 Line:1 Column:1092 Method must have a return type
Error:CS1031 Line:1 Column:1099 Type expected
Error:CS1519 Line:1 Column:1143 Invalid token '|' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:1169 Invalid token '|' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:1193 Invalid token ')' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:1204 Invalid token '=' in class, struct, or interface member declaration
Error:CS1519 Line:1 Column:1216 Invalid token '{' in class, struct, or interface member declaration
Error:CS1518 Line:1 Column:1235 Expected class, delegate, enum, interface, or struct
Error:CS1022 Line:1 Column:1712 Type or namespace definition, or end-of-file expected
"

This ‘Column: 1235’ is showing where the error is, right? But how to find this is maxscript editor(if it is possible)?


#17

just put all of your c# code to dotnetfiddle and try to run, it will show you places where the syntax isn’t correct
and btw, why not using single multiline string in mxs for c# source code? example

upd.
and to answer your question
can’t declare variables like this at the class level, move them into a ProcessDoc method or declare them with the explicit type

image


#18

No errors:

https://dotnetfiddle.net/FQKKLD


#19

I’ve posted updated version above, here’s the link: https://dotnetfiddle.net/7kNr05
you don’t really have to split it to collect the data


#20

Thank you.
Can this version work if I sent a string, not a txt file?