Fast search through text file


#21

The one I posted? Yes, it works only for strings.
If you want to read the text from the file line by line here’s the example from msdn


#22

Thank you.
Finally no error when compiling.
Here are the times - the code from my first post and the c#:

Lines: 48574

Find words 0.294 sec.
kappaArr: 1866
omicronArr: 0
upsilonArr: 0

Find words 0.037 sec.
: 0

Strange it returns 0 kappa matches while they are 1866.


#23

if the input data is the same in both cases then it is something with the c# code


#24

The input data is the strToCheck, generated in the maxscript. I pass it without any modifications.

dp = dotnetobject "DocProcessor"
	dp.ProcessDocument strToCheck
	kp = dp.kappa
	format ": %\n"  kp.count

If I change the c# code to check for the first word on the first line(for example it is “phi” ) all the time the returned result is also 0.
Can we force the c# code to print directly in maxscript listener? :slight_smile:

upd

Even if I remove all empty spaces at the beginning of the lines the result is 0.


#25

did you try the link I posted above? It worked well and lists weren’t empty

yes, you can print to the listener, but you’ll have to add autodesk.max.dll as a dll reference (dll of the max where you run the script)
and call
Autodesk.Max.GlobalInterface.Instance.TheListener.EditStream.Wputs & Autodesk.Max.GlobalInterface.Instance.TheListener.EditStream.Flush
or
Autodesk.Max.GlobalInterface.Instance.TheListener.EditStream.Printf


#26

could you post a test case and the current and desired numbers?


#27

as I understand it, the task is to find all occurrences of a specified substring in a large text file (~50k lines) by getting the line and position information. right?


#28

I’m pretty sure finding all occurrences is very fast if you only get the position. so I would first find all the positions of the end of the lines and then the positions of the substring… after that it’s quick and easy to find the line of the substring.


#29
re = python.import "re"
(
	t0 = timestamp()
	h0 = heapfree

n = (re.finditer "\n" strToCheck)

k = (re.finditer "kappa" strToCheck)
o = (re.finditer "omicron" strToCheck)
u = (re.finditer "upsilon" strToCheck)

	format "time:% heap:%\n" (timestamp() - t0) (h0 - heapfree)
)

--time:9 heap:1008L

python is very well suited for all string related methods… so we can target those numbers


#30

ok, managed to test the code in max and it turns out issue with zero matches was with the regex patterns required another couple of slashes for word boundry

var reOmicron = new Regex( \"^omicron\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );
var reKappa   = new Regex( \"^kappa\\\\b\",   RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
var reUpsilon = new Regex( \"^upsilon\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 

code:

(
	::dp_assembly = (
	local source = "using System;
	using System.Text;
	using System.Collections.Generic;
	using System.Text.RegularExpressions;

	public class DocProcessor
	{
		List<int> _kappa = new List<int>();
		List<string> _omicron = new List<string>();
		List<string> _upsilon = new List<string>();
		
		public int[] kappa { 
			get 
			{ 
				return _kappa.ToArray(); 
			}
		}
		
		public string[] omicron { 
			get 
			{ 
				return _omicron.ToArray(); 
			}
		}
		
		public string[] upsilon { 
			get 
			{ 
				return _upsilon.ToArray(); 
			}
		}

		public void ProcessDocument( string doc )
		{			
			_kappa.Clear();
			_omicron.Clear();
			_upsilon.Clear();
			
			var spaces = new char[]{' ','	'};
		
			var reOmicron = new Regex( \"^omicron\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );
			var reKappa   = new Regex( \"^kappa\\\\b\",   RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
			var reUpsilon = new Regex( \"^upsilon\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
			
			using (System.IO.StringReader sr = new System.IO.StringReader(doc))
			{
				int index = 0;
				string line;
				while ((line = sr.ReadLine()) != null)
				{
					var lline = line.TrimStart( spaces );
			
					if ( reOmicron.IsMatch( lline ) )
					{		
						_omicron.Add(line);						
					}
					else			
					if ( reKappa.IsMatch( lline ) )
					{
						_kappa.Add( index );
					}
					else
					if ( reUpsilon.IsMatch( lline ) )
					{
						_upsilon.Add( line );
					}
					
					index++;
				}
			}
		}
	}"

		csharpProvider = dotnetobject "Microsoft.CSharp.CSharpCodeProvider"
		compilerParams = dotnetobject "System.CodeDom.Compiler.CompilerParameters"
		compilerParams.ReferencedAssemblies.Add("System.dll");
		compilerParams.ReferencedAssemblies.Add("System.Windows.Forms.dll");

		compilerParams.GenerateInMemory = on
		compilerResults = csharpProvider.CompileAssemblyFromSource compilerParams #(source)


		if (compilerResults.Errors.Count > 0 ) then
		(
			local errs = stringstream ""
			for i = 0 to (compilerResults.Errors.Count-1) do
			(
				local err = compilerResults.Errors.Item[i]
				format "Error:% Line:% Column:% %\n" err.ErrorNumber err.Line err.Column err.ErrorText to:errs
			)
			format "%\n" errs
			undefined
		)
		else
		(
			compilerResults.CompiledAssembly		
		)

	)

	gc()
	newLineStr = "\n"
	newTabStr = "\t "	
	--( 	generate string
	space = " "
	nl = "\n"
	tab = "\t"
	tabNl = "\t\n"
	fmt = "%\n"
	
	wordsArr = #("alpha", "beta", "gama", "delta", "Epsilon", "Zeta", "Eta", "Theta", "Iota", "kaPPa", "LamBda", "mU", "Nu", "xi", "omicron", "pi", "rHo", "siGma", "Tau", "UpSiLoN", "pHi", "chi", "psi", "omega")
	ss = stringStream ""
	
	seed 12345
	for i = 1 to 50000 do
	(
		wordsCnt = random 10 20
		wArr = for j = 1 to wordsCnt collect (wordsArr[random 1 24])
		tabCnt = random 0 5
		
		str = ""
		if mod i 17 == 0 then
			str = tabNl
		else
		(
			if mod i 33 == 0 then
				str = nl
			else
			(
				for t = 1 to tabCnt do str += tab
				
				for w in wArr do str += space + w
			)
		)
		
		format fmt str to:ss
	)
	--)
	
	
	strToCheck = toLower (ss as string)
		

	gc()
	t0 = timestamp()
	
	

	/*
	ss = strToCheck as stringstream 
	seek ss 0
	while not eof ss do
	(
		ln = trimLeft (readline ss) " 	"
		
		if MatchPattern ln pattern:"kappa*"   then count += 1 else
		if MatchPattern ln pattern:"omicron*" then count += 1 else
		if MatchPattern ln pattern:"upsilon*" do count += 1		
	)
	*/
	
	dp = (dotNetClass "System.Activator").CreateInstance (dp_assembly.GetType("DocProcessor"))
	dp.ProcessDocument strToCheck
	
	t1 = timestamp()
	format "Find words %  sec.\n" ((t1-t0)/1000.0)
	
	format "kappaArr: %\n" dp.kappa.count
	format "omicronArr: %\n" dp.omicron.count
	format "upsilonArr: %\n" dp.upsilon.count
)

#31

looks very promising. how do you convert these iterators to a mxs array?


#32

it might be a smarter way, but I use the first that comes to mind:

cmd = python.import "__builtin__"
cmd.list k as array

#33

compared to a wall of text needed for c# python is definitely a winner :slight_smile:


#34

maybe (and very likely) we might have a memory issue with python in MXS


#35

Serejah, Denis, Thank you. :slight_smile:

Using this:

(
	::dp_assembly = (
	local source = "using System;
	using System.Text;
	using System.Collections.Generic;
	using System.Text.RegularExpressions;

	public class DocProcessor
	{
		List<int> _kappa = new List<int>();
		List<string> _omicron = new List<string>();
		List<string> _upsilon = new List<string>();
		
		public int[] kappa { 
			get 
			{ 
				return _kappa.ToArray(); 
			}
		}
		
		public string[] omicron { 
			get 
			{ 
				return _omicron.ToArray(); 
			}
		}
		
		public string[] upsilon { 
			get 
			{ 
				return _upsilon.ToArray(); 
			}
		}

		public void ProcessDocument( string doc )
		{			
			_kappa.Clear();
			_omicron.Clear();
			_upsilon.Clear();
			
			var spaces = new char[]{' ','	'};
		
			var reOmicron = new Regex( \"^omicron\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );
			var reKappa   = new Regex( \"^kappa\\\\b\",   RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
			var reUpsilon = new Regex( \"^upsilon\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
			
			using (System.IO.StringReader sr = new System.IO.StringReader(doc))
			{
				int index = 0;
				string line;
				while ((line = sr.ReadLine()) != null)
				{
					var lline = line.TrimStart( spaces );
			
					if ( reOmicron.IsMatch( lline ) )
					{		
						_omicron.Add(line);						
					}
					else			
					if ( reKappa.IsMatch( lline ) )
					{
						_kappa.Add( index );
					}
					else
					if ( reUpsilon.IsMatch( lline ) )
					{
						_upsilon.Add( line );
					}
					
					index++;
				}
			}
		}
	}"

		csharpProvider = dotnetobject "Microsoft.CSharp.CSharpCodeProvider"
		compilerParams = dotnetobject "System.CodeDom.Compiler.CompilerParameters"
		compilerParams.ReferencedAssemblies.Add("System.dll");
		compilerParams.ReferencedAssemblies.Add("System.Windows.Forms.dll");

		compilerParams.GenerateInMemory = on
		compilerResults = csharpProvider.CompileAssemblyFromSource compilerParams #(source)


		if (compilerResults.Errors.Count > 0 ) then
		(
			local errs = stringstream ""
			for i = 0 to (compilerResults.Errors.Count-1) do
			(
				local err = compilerResults.Errors.Item[i]
				format "Error:% Line:% Column:% %\n" err.ErrorNumber err.Line err.Column err.ErrorText to:errs
			)
			format "%\n" errs
			undefined
		)
		else
		(
			compilerResults.CompiledAssembly		
		)

	)
	
	re = python.import "re"	
	cmd = python.import "__builtin__"

	gc()
	newLineStr = "\n"
	newTabStr = "\t "	
	--( 	generate string
	space = " "
	nl = "\n"
	tab = "\t"
	tabNl = "\t\n"
	fmt = "%\n"
	
	wordsArr = #("alpha", "beta", "gama", "delta", "Epsilon", "Zeta", "Eta", "Theta", "Iota", "kaPPa", "LamBda", "mU", "Nu", "xi", "omicron", "pi", "rHo", "siGma", "Tau", "UpSiLoN", "pHi", "chi", "psi", "omega")
	ss = stringStream ""
	
	seed 12345
	for i = 1 to 50000 do
	(
		wordsCnt = random 10 20
		wArr = for j = 1 to wordsCnt collect (wordsArr[random 1 24])
		tabCnt = random 0 5
		
		str = ""
		if mod i 17 == 0 then
			str = tabNl
		else
		(
			if mod i 33 == 0 then
				str = nl
			else
			(
				for t = 1 to tabCnt do str += tab
				
				for w in wArr do str += space + w
			)
		)
		
		format fmt str to:ss
	)
	--)
	
	
	strToCheck = toLower (ss as string)
		

	gc()
	t0 = timestamp()
	h0 = heapfree
	
	dp = (dotNetClass "System.Activator").CreateInstance (dp_assembly.GetType("DocProcessor"))
	dp.ProcessDocument strToCheck
	
	t1 = timestamp()
	format "C# %  Heap: % \n" ((t1-t0)/1000.0) (h0 - heapfree)
	
	format "kappaArr: %\n" dp.kappa.count
	format "omicronArr: %\n" dp.omicron.count
	format "upsilonArr: %\n" dp.upsilon.count
	
	format ": %\n" dp.kappa[1]
	
	
	gc()
	t0 = timestamp()
	h0 = heapfree
	
	n = (re.finditer "\n" strToCheck)

	k = (re.finditer "kappa" strToCheck)
	o = (re.finditer "omicron" strToCheck)
	u = (re.finditer "upsilon" strToCheck)
	
	kappaArr = cmd.list k as array
	omicronArr = cmd.list o as array
	upsilonArr = cmd.list u as array
			
	t1 = timestamp()
	format "Python %  Heap: % \n" ((t1-t0)/1000.0) (h0 - heapfree)
	
	format "kappaArr: %\n" dp.kappa.count
	format "omicronArr: %\n" dp.omicron.count
	format "upsilonArr: %\n" dp.upsilon.count
	
	format ": %\n" dp.kappa[1]
	
)

The times are:

C# time: 0.055 Heap: 852L
kappaArr: 1896
omicronArr: 1876
upsilonArr: 1900
dp.kappa[1]: 11

Python 0.183 Heap: 6498652L
kappaArr: 28094
omicronArr: 28527
upsilonArr: 28390
kappaArr[1]: <_sre.SRE_Match object at 0x0000023B5728E8B8>
OK

Denis, I hope I can use my python code the same way as you are using the python in your example.
I have posted a thread on Autodesk forum asking almost the same - how a python code can be executed inside maxscript. Maybe using your approach will allow me to convert my code to something usable. The main goal is to learn something new.

The C# time is pretty fast.


#36

the pure MXS dotnet is not bad too:

(
	t0 = timestamp()
	h0 = heapfree

matches = (dotnetclass "System.Text.RegularExpressions.Regex").Matches strToCheck "kappa"
k = for i=0 to matches.count-1 collect (matches.item i).index

	format " count:% == time:% heap:%\n" k.count (timestamp() - t0) (h0 - heapfree) 
)
 

#37

the well-known issue kills performance - iterating of dotnet collections.


#38

If anyone knows how to make regEx to find ----- or any other number of - at the beginning of the line - please help. :slight_smile:
I will try to find the answer tomorrow evening. Now it is time to go to the bed.

Thank you one more time.


#39

as I said above, it’s very easy to get… just find ends of all lines first


#40

the regex syntax is pretty simple, still I’d suggest you to read some good book about how to write efficient expressions. It will make your life easier many many times in the future

^ - beginning of the line

(dotNetClass "system.text.regularexpressions.regex").isMatch "----"  "^[-]+"  -- true
(dotNetClass "system.text.regularexpressions.regex").isMatch " ----" "^[-]+"  -- false
(dotNetClass "system.text.regularexpressions.regex").isMatch " ----" "[-]+"   -- true
(dotNetClass "system.text.regularexpressions.regex").isMatch " ----" "-+"     -- true
(dotNetClass "system.text.regularexpressions.regex").isMatch " +-+-" "-[-+]+" -- true