George2

Hello everyone,


I have a large data set of integers (about several Million), all of them are positive integers from experiment. They are stored on a file on local machine. I want to select the largest 1000 values then make analysis based on the 1000 values.

Could anyone suggest some efficient solutions I think sorting the several M data in memory is not feasible.


thanks in advance,
George



Re: Visual C++ Language find the largest 1000 values

Hammadi Dali


Hi
You can simply store the values in a vector then sort them by the method
Code Block

sort( vec.begin(), vec.end() );


Have a look to this link

Regards
Hammadi




Re: Visual C++ Language find the largest 1000 values

Andreas Masur

Well...are these million if numbers are unique If there are duplicates what defines the first 1000 largest Are duplicates eliminated





Re: Visual C++ Language find the largest 1000 values

Pintu Shukla


See according to me it's not a good way that after retriving the data from File you Perform Shorting and then Check for redundancy etc see if 50 million is there how much time you are going to spend for this . So it's better if you store the data Inside the File aafter Performing some short for this if you wann you can use Some data structure concept too.

Like you can use Lionk List whenever enter a new element in list short the List and after Performing your calculation write down the list in the File or use vector etc.

Thanx





Re: Visual C++ Language find the largest 1000 values

Carl Daniel

Start with something simple, along the lines of:

Code Block

std :: set<item> items;

for each item in pool_of_items

{

items.insert(item);

if (items.size() > 1000)

items.erase(items.start())

}

and see if that gives acceptable performance for you. Only if it doesn't should you spend any more time looking for something better.






Re: Visual C++ Language find the largest 1000 values

George2

Thanks Hammadi,

The physical memory is not enough to hold all the data, the size of data will be grown to tens of G at the end. Any new ideas

Hammadi Dali wrote:

Hi
You can simply store the values in a vector then sort them by the method

Code Block

sort( vec.begin(), vec.end() );


Have a look to this link

Regards
Hammadi

regards,

George





Re: Visual C++ Language find the largest 1000 values

George2

Hi Andreas,

Andreas Masur wrote:
Well...are these million if numbers are unique If there are duplicates what defines the first 1000 largest Are duplicates eliminated

Most of them are not unique, actually they are multidimensional data, I use integer in this discussion thread to simplify the question. Any ideas

regards,

George




Re: Visual C++ Language find the largest 1000 values

Carl Daniel

George2 wrote:

The physical memory is not enough to hold all the data, the size of data will be grown to tens of G at the end. Any new ideas

If you're that worried about the ultimate size of the data, then use a database. If a general purpose database is too much, consider one of the various B-Tree (B+Tree, B*Tree) libraries or "in process databases" that are available. One you should definitely consider is SQL Server Compact Edition:

http://www.microsoft.com/sql/editions/compact/default.mspx






Re: Visual C++ Language find the largest 1000 values

George2

Thanks Carl,

Carl Daniel wrote:

Start with something simple, along the lines of:

Code Block

std :: set<item> items;

for each item in pool_of_items

{

items.insert(item);

if (items.size() > 1000)

items.erase(items.start())

}

and see if that gives acceptable performance for you. Only if it doesn't should you spend any more time looking for something better.

If I understand correct, you need to read all data into a set, and in my situation the footprint of data is very large and physical memory can not contain. Any ideas or comments

regards,

George





Re: Visual C++ Language find the largest 1000 values

George2

Thanks Pintu,

If I understand your idea correctly, you mean reading all the data into memory from the file which stores the original data, right In my situation, the size of data is larger than the physical memory, so your solution is not suitable for me (correct me if I do not understand your solution correctly). Do you have any new ideas or comments

Pintu Shukla wrote:

See according to me it's not a good way that after retriving the data from File you Perform Shorting and then Check for redundancy etc see if 50 million is there how much time you are going to spend for this . So it's better if you store the data Inside the File aafter Performing some short for this if you wann you can use Some data structure concept too.

Like you can use Lionk List whenever enter a new element in list short the List and after Performing your calculation write down the list in the File or use vector etc.

Thanx

regards,

George





Re: Visual C++ Language find the largest 1000 values

Carl Daniel

If I understand correct, you need to read all data into a set, and in my situation the footprint of data is very large and physical memory can not contain. Any ideas or comments

No, the idea is to read the data one element (or a few) at a time, but only keep a std :: set of the largest 1000 elements. The memory usage is bounded no matter how large the complete dataset is.






Re: Visual C++ Language find the largest 1000 values

George2

Thanks Carl,

I must learn SQL Server. Sometimes, I think the SQL Server is too complex to learn and maintain. It is appreciated if you could have other ideas which do not use SQL Server.

Carl Daniel wrote:

George2 wrote:

The physical memory is not enough to hold all the data, the size of data will be grown to tens of G at the end. Any new ideas

If you're that worried about the ultimate size of the data, then use a database. If a general purpose database is too much, consider one of the various B-Tree (B+Tree, B*Tree) libraries or "in process databases" that are available. One you should definitely consider is SQL Server Compact Edition:

http://www.microsoft.com/sql/editions/compact/default.mspx

regards,

George





Re: Visual C++ Language find the largest 1000 values

George2

Sorry Carl, I mis-understood your original idea. Do you think making the set larger (e.g. 10,000 elements) will have better performance

Carl Daniel wrote:

If I understand correct, you need to read all data into a set, and in my situation the footprint of data is very large and physical memory can not contain. Any ideas or comments

No, the idea is to read the data one element (or a few) at a time, but only keep a std :: set of the largest 1000 elements. The memory usage is bounded no matter how large the complete dataset is.

regards,

George





Re: Visual C++ Language find the largest 1000 values

Ralf G

I think Carl meant to point out the expression:

if (items.size() > 1000)

items.erase(items.start())


This implies that you will never need to store more than 1000 elements in memory.
If you replace the
for each item in pool_of_items with an expression that reads x-lines from the file at a time, tries to insert these, read the next x-lines it is definately a solution, it just depends on the speed you need to do it at.

Another option would be to do Carl's suggestion in bulk:

Read x (Say 50 000 but try to find an optimal value for x) items from the file, sort these (
if I remember correctly the vector sort is a fairly efficient stable sort). Copy the biggest 1000 into your end result array.

Repeat this until end of file:
Read the next x items from the file. Add the 1000 items to the 50000 and sort using the vector sort.
Copy 1000 biggest back into result array

If it's not efficient you might want to investigate other faster sorting algorithms





Re: Visual C++ Language find the largest 1000 values

George2

Thanks Ralf,

I have understood your solutions. I have a new idea, which utilizes STL algorithm nth_element. But I am not sure its time complexity and how it works internally. Do you have any comments or ideas

Ralf G wrote:
I think Carl meant to point out the expression:

if (items.size() > 1000)

items.erase(items.start())


This implies that you will never need to store more than 1000 elements in memory.
If you replace the
for each item in pool_of_items with an expression that reads x-lines from the file at a time, tries to insert these, read the next x-lines it is definately a solution, it just depends on the speed you need to do it at.

Another option would be to do Carl's suggestion in bulk:

Read x (Say 50 000 but try to find an optimal value for x) items from the file, sort these (
if I remember correctly the vector sort is a fairly efficient stable sort). Copy the biggest 1000 into your end result array.

Repeat this until end of file:
Read the next x items from the file. Add the 1000 items to the 50000 and sort using the vector sort.
Copy 1000 biggest back into result array

If it's not efficient you might want to investigate other faster sorting algorithms

regards,

George